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Introduction 


Essentially, this book is about algorithmic developments in the context of 
genetic algorithms (GAs) and genetic programming (GP); we also describe 
their applications to significant combinatorial optimization problems as well 
as structure identification using HeuristicLab as a platform for algorithm de- 
velopment. The main issue of the theoretical considerations is to obtain a 
better understanding of the basic workflow of GAs and GP, in order to estab- 
lish new bionic, problem independent theoretical concepts and to substantially 
increase the achievable solution quality. 


The book is structured into a theoretical and an empirical part. The aim of 
the theoretical part is to describe the important and characteristic properties 
of the basic genetic algorithm as well as the main characteristics of the algo- 
rithmic extensions introduced here. The empirical part of the book elaborates 
two case studies: On the one hand, the traveling salesman problem (TSP) and 
the capacitated vehicle routing problem (CVRP) are used as representatives 
for GAs applied to combinatorial optimization problems. On the other hand, 
GP-based nonlinear structure identification applied to time series and clas- 
sification problems is analyzed to highlight the properties of the algorithmic 
measures in the field of genetic programming. The borderlines between theory 
and practice become indistinct in some parts as it is also necessary to describe 
theoretical properties on the basis of practical examples in the first part of the 
book. For this purpose we go back to some small-dimensioned TSP instances 
that are perfectly suited for theoretical GA considerations. 


Research concerning the self-adaptive interplay between selection and the 
applied solution manipulation operators (crossover and mutation) is the basis 
for the algorithmic developments presented in this book. The ultimate goal in 
this context is to avoid the disappearance of relevant building blocks and to 
support the combination of those alleles from the gene pool that carry solution 
properties of highly fit individuals. As we show in comparative test series, in 
conventional GAs and GP this relevant genetic information is likely to get lost 
quite early in the standard variants of these algorithms and can only be rein- 
troduced into the population’s gene pool by mutation. This dependence on 
mutation can be drastically reduced by new generic selection principles based 
upon either self-adaptive selection pressure steering (offspring selection, OS) 
or self-adaptive population size adjustment as proposed in the relevant alleles 
preserving genetic algorithm (RAPGA). Both algorithmic extensions certify 
the survival of essential genetic information by supporting the survival of rel- 
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evant alleles rather than the survival of above average chromosomes. This 
is achieved by defining the survival probability of a new child chromosome 
depending on the child’s fitness in comparison to the fitness values of its own 
parents. With these measures it becomes possible to channel the relevant 
alleles, which are initially scattered in the entire population, to single chro- 
mosomes at the end of the genetic search process. 


The SASEGASA algorithm is a special coarse-grained parallel GA; the 
acronym “SASEGASA” hereby stands for Self- Adaptive Segregative Genetic 
Algorithm including aspects of Simulated Annealing. SASEGASA combines 
offspring selection with enhanced parallelization concepts in order to avoid 
premature convergence, one of the major problems with GAs. As we will 
show for the TSP, it becomes possible to scale the robustness and particularly 
the achievable solution quality by the number of subpopulations. 


Due to the high focus on sexual recombination, evolution strategies (ES) 
are not considered explicitly in this book. Nevertheless, many of the theoret- 
ical considerations are heavily inspired by evolution strategies, especially the 
aspect of selection after reproduction and (self-)adaptive selection pressure 
steering. Aside from other variants of evolutionary computation, further in- 
spirations are borrowed from fields, as for example, population genetics. The 
implementation of bionic ideas for algorithmic developments is quite prag- 
matic and ignores debates on principles that are discussed in natural sciences. 
Of course, we are always aware of the fact that artificial evolution as per- 
formed in an evolutionary algorithm is situated on a high level of abstraction 
compared to the biological role model in any case. 


The problem-oriented part of the book is dedicated to the application of 
the algorithmic concepts described in this book to benchmark as well as real 
world problems. Concretely, we examine the traveling salesman problem and 
the capacitated vehicle routing problem (which is thematically related to the 
TSP), but more in step with actual practice, as representatives of combina- 
torial optimization problems. 


Time series and classification analysis are used as application areas of data- 
based structure identification with genetic programming working with for- 
mula trees representing mathematical models. As a matter of principle, we 
use standard problem representations and the appropriate problem-specific 
genetic operators known from GA and GP theory for the experiments shown 
in these chapters. The focus is set on the comparison of results achievable with 
standard GA and GP implementations to the results achieved using the ex- 
tended algorithmic concepts described in this book. These enhanced concepts 
do not depend on a concrete problem representation and its operators; their 
influences on population dynamics in GA and GP populations are analyzed, 
too. 
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Additional material related to the research described in this book is pro- 
vided on the book’s homepage at http://gagp2009.heuristiclab. com. 
Among other information this website provides some of the software used 
as well as dynamical presentations of representative test runs as additional 
material. 


Chapter 1 


Simulating Evolution: Basics about 
Genetic Algorithms 


1.1 The Evolution of Evolutionary Computation 


Work on what is nowadays called evolutionary computation started in the 
sixties of the 20th century in the United States and Germany. There have 
been two basic approaches in computer science that copy evolutionary mech- 
anisms: evolution strategies (ES) and genetic algorithms (GA). Genetic al- 
gorithms go back to Holland [Hol75], an American computer scientist and 
psychologist who developed his theory not only under the aspect of solving 
optimization problems but also to study self-adaptiveness in biological pro- 
cesses. Essentially, this is the reason why genetic algorithms are much closer 
to the biological model than evolution strategies. The theoretical foundations 
of evolution strategies were formed by Rechenberg and Schwefel (see for ex- 
ample [Rec73] or [Sch94]), whose primary goal was optimization. Although 
these two concepts have many aspects in common, they developed almost in- 
dependently from each other in the USA (where GAs were developed) and 
Germany (where research was done on ES). 

Both attempts work with a population model whereby the genetic informa- 
tion of each individual of a population is in general different. Among other 
things this genotype includes a parameter vector which contains all necessary 
information about the properties of a certain individual. Before the intrinsic 
evolutionary process takes place, the population is initialized arbitrarily; evo- 
lution, i.e., replacement of the old generation by a new generation, proceeds 
until a certain termination criterion is fulfilled. 

The major difference between evolution strategies and genetic algorithms 
lies in the representation of the genotype and in the way the operators are used 
(which are mutation, selection, and eventually recombination). In contrast 
to GAs, where the main role of the mutation operator is simply to avoid 
stagnation, mutation is the primary operator of evolution strategies. 

Genetic programming (GP), an extension of the genetic algorithm, is a 
domain-independent, biologically inspired method that is able to create com- 
puter programs from a high-level problem statement. In fact, virtually all 
problems in artificial intelligence, machine learning, adaptive systems, and 


2 Genetic Algorithms and Genetic Programming 


automated learning can be recast as a search for a computer program; genetic 
programming provides a way to search for a computer program in the space 
of computer programs (as formulated by Koza in [Koz92a]). Similar to GAs, 
GP works by imitating aspects of natural evolution, but whereas GAs are 
intended to find arrays of characters or numbers, the goal of a GP process is 
to search for computer programs (or, for example, formulas) solving the opti- 
mization problem at hand. As in every evolutionary process, new individuals 
(in GP’s case, new programs) are created. They are tested, and the fitter ones 
in the population succeed in creating children of their own whereas unfit ones 
tend to disappear from the population. 

In the following sections we give a detailed description of the basics of 
genetic algorithms in Section 1.2, take a look at the corresponding biological 
terminology in Section 1.3, and characterize the operators used in GAs in 
Section 1.4. Then, in Section 1.5 we discuss problem representation issues, 
and in Section 1.6 we summarize the schema theory, an essentially important 
concept for understanding not only how, but also why GAs work. Parallel 
GA concepts are given in Section 1.7, and finally we discuss the interplay of 
genetic operators in Section 1.8. 


1.2 The Basics of Genetic Algorithms 


Concerning its internal functioning, a genetic algorithm is an iterative pro- 
cedure which usually operates on a population of constant size and is basically 
executed in the following way: 

An initial population of individuals (also called “solution candidates” or 
“chromosomes” ) is generated randomly or heuristically. During each itera- 
tion step, also called a “generation,” the individuals of the current population 
are evaluated and assigned a certain fitness value. In order to form a new pop- 
ulation, individuals are first selected (usually with a probability proportional 
to their relative fitness values), and then produce offspring candidates which 
in turn form the next generation of parents. This ensures that the expected 
number of times an individual is chosen is approximately proportional to its 
relative performance in the population. For producing new solution candi- 
dates genetic algorithms use two operators, namely crossover and mutation: 


e Crossover is the primary genetic operator: It takes two individuals, 
called parents, and produces one or two new individuals, called offspring, 
by combining parts of the parents. In its simplest form, the operator 
works by swapping (exchanging) substrings before and after a randomly 
selected crossover point. 


e The second genetic operator, mutation, is essentially an arbitrary mod- 
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ification which helps to prevent premature convergence by randomly 
sampling new points in the search space. In the case of bit strings, 
mutation is applied by simply flipping bits randomly in a string with a 
certain probability called mutation rate. 


Genetic algorithms are stochastic iterative algorithms, which cannot guar- 
antee convergence; termination is hereby commonly triggered by reaching a 
maximum number of generations or by finding an acceptable solution or more 
sophisticated termination criteria indicating premature convergence. We will 
discuss this issue in further detail within Chapter 3. 

The so-called standard genetic algorithm (SGA), which represents the basis 
of almost all variants of genetic algorithms, is given in Algorithm 1.1 (which 
is formulated as in [Tom95], for example). 


Algorithm 1.1 Basic workflow of a genetic algorithm. 

Produce an initial population of individuals 

Evaluate the fitness of all individuals 

while termination condition not met do 
Select fitter individuals for reproduction and produce new individuals 
(crossover and mutation) 
Evaluate fitness of new individuals 
Generate a new population by inserting some new “good” individuals and 
by erasing some old “bad” individuals 

end while 


A special and quite restricted GA variant, that has represented the basis for 
theoretical considerations for a long period of time, is given in Figure 1.1. This 
chart sketches a GA with binary representation operating with generational 
replacement, a population of constant size, and the following genetic opera- 
tors: roulette wheel selection, single point crossover, and bit flip mutation. 
This special type of genetic algorithms, which is the basis for theoretical GA 
research such as the well known schema theorem and accordingly the building 
block hypothesis, is also called the canonical genetic algorithm (CGA). 


1.3 Biological Terminology 


The approximative way of solving optimization problems by genetic algo- 
rithms holds a strong analogy to the basic principles of biological evolution. 
The fundamentals of the natural evolution theory, as it is considered nowa- 
days, mainly refer to the theories of Charles Darwin, which were published 
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FIGURE 1.1: The canonical genetic algorithm with binary solution encoding. 


in 1859 in his well-known work “The Origin of Species By Means of Natural 
Selection or the Preservation of Favoured Races in the Struggle for Life” (re- 


vised edition: [Dar98]). In this work Darwin states the following five major 
ideas: 


e Evolution, change in lineages, occurs and occurred over time. 
e All creatures have common descent. 
e Natural selection determines changes in nature. 


e Gradual change, i.e., nature changes somehow successively. 
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e Speciation, i.e., Darwin claimed that the process of natural selection 
results in populations diverging enough to become separate species. 


Although some of Darwin’s proposals were not new, his ideas (particularly 
those on common descent and natural selection) provided the first solid foun- 
dation upon which evolutionary biology has been built. 

At this point it may be useful to formally introduce some essential parts of 
the biological terminology which are used in the context of genetic algorithms: 


e All living organisms consist of cells containing the same set of one or 
more chromosomes, i.e., strings of DNA. A gene can be understood 
as an “encoder” of a characteristic, such as eye color. The different 
possibilities for a characteristic (e.g., brown, green, blue, gray) are called 
alleles. Each gene is located at a particular position (locus) on the 
chromosome. 


e Most organisms have multiple chromosomes in each cell. The sum of all 
chromosomes, i.e., the complete collection of genetic material, is called 
the genome of the organism and the term genotype refers to the partic- 
ular set of genes contained in a genome. Therefore, if two individuals 
have identical genomes, they are said to have the same genotype. 


e Organisms whose chromosomes are arranged in pairs are called diploid, 
whereas organisms with unpaired chromosomes are called haploid. In 
nature, most sexually reproducing species are diploid. Humans for in- 
stance have 23 pairs of chromosomes in each somatic cell in their body. 
Recombination (crossover) occurs during sexual reproduction in the fol- 
lowing way: 


e For producing a new child, the genes of the parents are combined to 
eventually form a new diploid set of chromosomes. Offspring are sub- 
ject to mutation where elementary parts of the DNA (nucleotides) are 
changed. The fitness of an organism (individual) is typically defined as 
its probability to reproduce, or as a function of the number of offspring 
the organism has produced. 


For the sake of simplification, in genetic algorithms the term chromosome 
refers to a solution candidate (in the first GAs encoded as a bit). The genes are 
either single bits or small blocks of neighboring bits that encode a particular 
element of the solution. Even if an allele usually is either 0 or 1, for larger 
alphabets more alleles are possible at each locus. 

As a further simplification to the biological role model, crossover typically 
operates by exchanging genetic material between two haploid parents whereas 
mutation is implemented by simply flipping the bit at a randomly chosen locus. 

Finally it is remarkable that most applications of genetic algorithms employ 
haploid single-chromosome individuals, although the evolution of mankind has 
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inspired the GA-community at most. This is most probably due to the easier 
and more effective representation and implementation of single-chromosome 
individuals. 


1.4 Genetic Operators 


In the following, the main genetic operators, namely parent selection, 
crossover, mutation, and replacement are to be described. The focus hereby 
lies on a functional description of the principles rather than to give a complete 
overview of operator concepts; for more details about genetic operators the 
interested reader is referred to textbooks as for example [DLJDOO]. 


1.4.1 Models for Parent Selection 


In genetic algorithms a fitness function assigns a score to each individual in 
a population; this fitness value indicates the quality of the solution represented 
by the individual. The fitness function is often given as part of the problem 
description or based on the objective function; developing an appropriate 
fitness function may also involve the use of simulation, heuristic techniques, or 
the knowledge of an expert. Evaluating the fitness function for each individual 
should be relatively fast due to the number of times it will be invoked. If 
the evaluation is likely to be slow, then concepts of parallel and distributed 
computing, an approximate function evaluation technique, or a technique, 
that only considers elements that have changed, may be employed. 

Once a population has been generated and its fitness has been measured, 
the set of solutions, that are selected to be “mated” in a given generation, is 
produced. In the standard genetic algorithm (SGA) the probability, that a 
chromosome of the current population is selected for reproduction, is propor- 
tional to its fitness. 

In fact, there are many ways of accomplishing this selection. These include: 


e Proportional selection (roulette wheel selection): 
The classical SGA utilizes this selection method which has been pro- 
posed in the context of Holland’s schema theorem (which will be ex- 
plained in detail in Section 1.6). Here the expected number of descen- 
dants for an individual 7 is given as p; = 4 with f : S — R* denoting 
the fitness function and f representing the average fitness of all indi- 
viduals. Therefore, each individual of the population is represented by 
a space proportional to its fitness. By repeatedly spinning the wheel, 
individuals are chosen using random sampling with replacement. In or- 
der to make proportional selection independent from the dimension of 
the fitness values, so-called windowing techniques are usually employed. 
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Further variants of proportional selection aim to reduce the dominance 
of a single or a group of highly fit individuals (“super individuals” ) by 
stochastic sampling techniques (as for example explained in [DLJD00]). 


e Linear-rank selection: 

In the context of linear-rank selection the individuals of the population 
are ordered according to their fitness and copies are assigned in such a 
way that the best individual receives a pre-determined multiple of the 
number of copies the worst one receives [GB89]. On the one hand rank 
selection implicitly reduces the dominating effects of “super individuals” 
in populations (i.e., individuals that are assigned a significantly better 
fitness value than all other individuals), but on the other hand it warps 
the difference between close fitness values, thus increasing the selection 
pressure in stagnant populations. Even if linear-rank selection has been 
used with some success, it ignores the information about fitness differ- 
ences of different individuals and violates the schema theorem. 


e Tournament selection: 
There are a number of variants on this theme. The most common one 
is k-tournament selection where k individuals are selected from a pop- 
ulation and the fittest individual of the k selected ones is considered 
for reproduction. In this variant selection pressure can be scaled quite 
easily by choosing an appropriate number for k. 


1.4.2 Recombination (Crossover) 


In its easiest formulation, which is suggested in the canonical GA for binary 
encoding, crossover takes two individuals and cuts their chromosome strings 
at some randomly chosen position. The produced substrings are then swapped 
to produce two new full length chromosomes. 

Conventional crossover techniques for binary representation include: 


e Single point crossover: 
A single random cut is made, producing two head sections and two 
tail sections. The two tail sections are then swapped to produce two 
new individuals (chromosomes); Figure 1.2 schematically sketches this 
crossover method which is also called one point crossover. 


e Multiple point crossover: 
One natural extension of the single point crossover is the multiple point 
crossover: In a n-point crossover there are n crossover points and sub- 
strings are swapped between the n points. According to some re- 
searchers, multiple-point crossover is more suitable to combine good fea- 
tures present in strings because it samples uniformly along the full length 
of a chromosome [Ree95]. At the same time, multiple-point crossover be- 
comes more and more disruptive with an increasing number of crossover 
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Crossover Point 


Crossover 


FIGURE 1.2: Schematic display of a single point crossover. 


points, i.e., the evolvement of longer building blocks becomes more and 
more difficult. Decreasing the number of crossover points during the 
run of the GA may be a good compromise. 


e Uniform crossover: 

Given two parents, each gene in the offspring is created by copying 
the corresponding gene from one of the parents. The selection of the 
corresponding parent is undertaken via a randomly generated crossover 
mask: At each index, the offspring gene is taken from the first parent 
if there is a 1 in the mask at this index, and otherwise (if there is a 0 
in the mask at this index) the gene is taken from the second parent. 
Due to this construction principle uniform crossover does not support 
the evolvement of higher order building blocks. 


The choice of an appropriate crossover operator depends very much on the 
representation of the search space (see also Section 1.5). Sequencing problems 
as routing problems for example often require operators different from the ones 
described above as almost all generated children may be situated outside of 
the space of valid solutions. 

In higher order representations, a variety of real-number combination op- 
erators can be employed, such as the average and geometric mean. Domain 
knowledge can be used to design local improvement operators which some- 
times allow more efficient exploration of the search space around good solu- 
tions. For instance, knowledge could be used to determine the appropriate 
locations for crossover points. 

As the number of proposed problem-specific crossover-techniques has been 
growing that much over the years, it would go beyond the scope of the present 
book even to discuss the more important ones. For a good discussion of 
crossover-related issues and further references the reader is referred to [Mic92] 
and [DLJD00]. 
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1.4.3 Mutation 


Mutations allow undirected jumps to slightly different areas of the search 
space. The basic mutation operator for binary coded problems is bitwise 
mutation. Mutation occurs randomly and very rarely with a probability Pm; 
typically, this mutation rate is less than ten percent. In some cases mutation 
is interpreted as generating a new bit and in others it is interpreted as flipping 
the bit. 

In higher order alphabets, such as integer numbering formulations, muta- 
tion takes the form of replacing an allele with a randomly chosen value in the 
appropriate range with probability pm. However, for combinatorial optimiza- 
tion problems, such mutation schemes can cause difficulties with chromosome 
legality; for example, multiple copies of a given value can occur which might 
be illegal for some problems (including routing). Alternatives suggested in 
literature include pairwise swap and shift operations as for instance described 
in [Car94]. 

In addition, adaptive mutation schemes similar to mutation in the context 
of evolution strategies are worth mentioning. Adaptive mutation schemes 
vary either the rate, or the form of mutation, or both during a GA run. For 
instance, mutation is sometimes defined in such a way that the search space 
is explored uniformly at first and more locally towards the end, in order to do 
a kind of local improvement of candidate solutions [Mic92]. 


1.4.4 Replacement Schemes 


After having generated a new generation of descendants (offspring) by 
crossover and mutation, the question arises which of the new candidates should 
become members of the next generation. In the context of evolution strategies 
this fact determines the life span of the individuals and substantially influ- 
ences the convergence behavior of the algorithm. A further strategy influenc- 
ing replacement quite drastically is offspring selection which will be discussed 
separately in Chapter 4. The following schemes are possible replacement 
mechanisms for genetic algorithms: 


e Generational Replacement: 
The entire population is replaced by its descendants. Similar to the 
(u, A) evolution strategy it might therefore happen that the fitness of 
the best individual decreases at some stage of evolution. Additionally, 
this strategy puts into perspective the dominance of a few individuals 
which might help to avoid premature convergence [SHF94]. 


e Elitism: 
The best individual (or the n best individuals, respectively) of the pre- 
vious generation are retained for the next generation which theoretically 
allows immortality similar to the (u + A) evolution strategy and might 
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be critical with respect to premature convergence. The special and com- 
monly applied strategy of just retaining one (the best) individual of the 
last generation is also called the “golden cage model,” which is a special 
case of n-elitism with n = 1. If mutation is applied to the elite in order 
to prevent premature convergence, the replacement mechanism is called 
“weak elitism.” 


e Delete-n-last: 
The n weakest individuals are replaced by n descendants. If n « |POP| 
we speak of a steady-state replacement scheme; for n = 1 the changes 
between the old and the new generation are certainly very small and n = 
|POP| gives the already introduced generational replacement strategy. 


e Delete-n: 
In contrast to the delete-n-last replacement strategy, here not the n 
weakest but rather n arbitrarily chosen individuals of the old generation 
are replaced, which on the one hand reduces the convergence speed of 
the algorithm but on the other hand also helps to avoid premature 
convergence (compare elitism versus weak elitism). 


e Tournament Replacement: 
Competitions are run between sets of individuals from the last and the 
actual generation, with the winners becoming part of the new popula- 
tion. 


A detailed description of replacement schemes and their effects can be found 
for example in [SHF94], [Mic92], [DLJD00], and [Mit96]. 


1.5 Problem Representation 


As already stated before, the first genetic algorithm presented in literature 
[Hol75] used binary vectors for the representation of solution candidates (chro- 
mosomes). Consequently, the first solution manipulation operators (single 
point crossover, bit mutation) have been developed for binary representation. 
Furthermore, this very simple GA, also commonly known as the canonical 
genetic algorithm (CGA), represents the basis for extensive theoretical in- 
spections, resulting in the well known schema theorem and the building block 
hypothesis ([Hol75], [Gol89]). This background theory will be examined sep- 
arately in Section 1.6, as it defines the scope of almost any GA as it should 
ideally be and distinguishes GAs from almost any other heuristic optimization 
technique. 

The unique selling point of GAs is to compile so-called building blocks, 
i.e., somehow linked parts of the chromosome which become larger as the 
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algorithm proceeds, advantageously with respect to the given fitness function. 
In other words, one could define the claim of a GA as to be an algorithm which 
is able to assemble the basic modules of highly fit or even globally optimal 
solutions (which the algorithm of course does not know about). These basic 
modules are with some probability already available in the initial population, 
but widespread over many individuals; the algorithm therefore has to compile 
these modules in such a clever way that continuously growing sequences of 
highly qualified alleles, the so-called building blocks, are formed. 

Compared to heuristic optimization techniques based on neighborhood 
search (as tabu search [Glo86] or simulated annealing [KGV83], for exam- 
ple), the methodology of GAs to combine partial solutions (by crossover) is 
potentially much more robust with respect to getting stuck in local but not 
global optimal solutions; this tendency of neighborhood-based searches de- 
notes a major drawback of these heuristics. Still, when applying GAs the 
user has to draw much more attention on the problem representation in or- 
der to help the algorithm to fulfill the claim stated above. In that sense the 
problem representation must allow the solution manipulation operators, es- 
pecially crossover, to combine alleles of different parent individuals. This is 
because crossover is responsible for combining the properties of two solution 
candidates which may be located in very different regions of the search space 
so that valid new solution candidates are built. This is why the problem rep- 
resentation has to be designed in a way that crossover operators are able to 
build valid new children (solution candidates) with a genetic make up that 
consists of the union set of its parent alleles. 

Furthermore, as a tribute to the general functioning of GAs, the crossover 
operators also have to support the potential development of higher-order 
building blocks (longer allele sequences). Only if the genetic operators for 
a certain problem representation show these necessary solution manipulator 
properties, the corresponding GA can be expected to work as it should, i.e., 
in the sense of a generalized interpretation of the building block hypothesis. 

Unfortunately, a lot of more or less established problem representations are 
not able to fulfill these requirements, as they do not support the design of 
potentially suited crossover operators. Some problem representations will be 
considered exemplarily in the following attracting notice to their ability to 
allow meaningful crossover procedures. Even if mutation, the second solution 
manipulation concept of GAs, is also of essential importance, the design of 
meaningful mutation operators is much less challenging as it is a lot easier 
to fulfill the requirements of a suited mutation operator (which in fact is to 
introduce a small amount of new genetic information). 


1.5.1 Binary Representation 


In the early years of GA research there was a strong focus on binary encod- 
ing of solution candidates. To some extent, an outgrowth of these ambitions 
is certainly the binary representation for the TSP. There have been different 
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ways how to use binary representation for the TSP, the most straightforward 
one being to encode each city as a string of logon bits and a solution candidate 
as a string of n(loggn) bits. Crossover is then simply performed by applying 
single-point crossover as proposed by Holland [Hol75]. Further attempts us- 
ing binary encoding have been proposed using binary matrix representation 
([FM91], [HGL93]). In [HGL93], Homaifar and Guan for example defined a 
matrix element in the i-th row and the j-th column to be 1 if and only if in the 
tour city j is visited after city i; they also applied one- or two- point crossover 
on the parent matrices, which for one-point crossover means that the child 
tour is created by just taking the column vectors left of the crossover point 
from one parent, and the column vectors right of the crossover point from the 
other parent. 


Obviously, these strategies lead to highly illegal tours which are then re- 
paired by additional repair strategies [HGL93], which is exactly the point 
where a GA can no longer act as it is supposed to. As the repair strate- 
gies have to introduce a high amount of genetic information which is neither 
from the one nor from the other parent, child solutions emerge whose genetic 
make-up has only little in common with its own parents; this counteracts the 
general functioning of GAs as given in a more general interpretation of the 
schema theorem and the according building block hypothesis. 


1.5.2 Adjacency Representation 


Using the adjacency representation for the TSP (as described in [LKM799], 
e.g.), a city j is listed in position i if and only if the tour leads from city i to 
city j. Based on the adjacency representation, the so-called alternating edges 
crossover has been proposed for example which basically works as follows: 
First it chooses an edge from one parent and continues with the position of 
this edge in the other parent representing the next edge, etc. The partial 
tour is built up by choosing edges from the two parents alternatingly. In case 
this strategy would produce a cycle, the edge is not added, but instead the 
operator randomly selects an edge from the edges which do not produce a 
cycle and continues in the way described above. 


Compared to the crossover operators based on binary encoding, this strat- 
egy has the obvious advantage that a new child is built up from edges of its 
own parents. However, also this strategy is not very well suited as a fur- 
ther claim to crossover is not fulfilled at all: The alternating edges crossover 
cannot inherit longer tour segments and therefore longer building blocks can- 
not establish. As a further development to the alternating edges crossover, 
the so-called sub-tour chunks crossover aims to put things right by not alter- 
nating the edges but sub-tours of the two parental solutions. However, the 
capabilities of this strategy are also rather limited. 
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1.5.3 Path Representation 


The most natural representation of a TSP tour is given by the path repre- 
sentation. Within this representation, the n cities of a tour are put in order 
according to a list of length n, so that the order of cities to be visited is given 
by the list entries with an imaginary edge from the last to the first list entry. A 
lot of crossover and mutation operators have been developed based upon this 
representation, and most of the nowadays used TSP solution methods using 
GAs are realized using path representation. Despite obvious disadvantages 
like the equivocality of this representation (the same tour can be described in 
2n different ways for a symmetrical TSP and in n different ways for an asym- 
metrical TSP) this representation has allowed the design of quite powerful 
operators like the order crossover (OX) or the edge recombination crossover 
(ERX) which are able to inherit parent sub-tours to child solutions with only 
a rather small ratio of edges stemming from none of its own parents which is 
essential for GAs. A detailed description of these operators is given in Chapter 
8. 


1.5.4 Other Representations for Combinatorial 
Optimization Problems 


Combinatorial optimization problems that are more in step with actual 
practice than the TSP require more complex problem representations, which 
makes it even more difficult for the designer of genetic solution manipulation 
operators to construct crossover operators that fulfill the essential require- 
ments. 


Challenging optimization tasks arise in the field of logistics and production 
planning optimization where the capacitated vehicle routing problem with 
(CVRPTW, [Tha95]) and without time windows (CVRP, [DR59]) as well as 
the job shop scheduling problem (JSSP [Tai93]) denote abstracted standard 
formulations which are used for the comparison of optimization techniques on 
the basis of widely available standardized benchmark problems. Tabu search 
[Glo86] and genetic algorithms are considered the most powerful optimiza- 
tion heuristics for these rather practical combinatorial optimization problems 
[BR03]. 

Cheng et al. as well as Yamada and Nakano give a comprehensive review 
of problem representations and corresponding operators for applying Genetic 
Algorithms to the JSSP in [CGT99] and [YN97], respectively. 


For the CVRP, Bräysy and Gendreau give a detailed overview about the 
application of local search algorithms in [BG05a] and about the application 
of metaheuristics in [BG05b]; concrete problem representations and crossover 
operators for GAs are outlined in [PB96] and [Pri04]. Furthermore, the appli- 
cation of extended GA concepts to the CVRP will be covered in the practical 
part of this book within Chapter 10. 
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1.5.5 Problem Representations for Real-Valued Encoding 


When using real-valued encoding, a solution candidate is represented as 
a real-valued vector in which the dimension of the chromosomes is constant 
and equal to the dimension of the solution vectors. Crossover concepts are 
distinguished into discrete and continuous recombination where the discrete 
variants copy the exact allele values of the parent chromosomes to the child 
chromosome whereas the continuous variants perform some kind of averaging. 

Mutation operators for real-valued encoding either slightly modify all po- 
sitions of the gene or introduce major changes to only some (often just one) 
position. Often a mixture of different crossover and mutation techniques leads 
to the best results for real-valued GAs. A comprehensive review of crossover 
and mutation techniques including also more sophisticated techniques like 
multi-parent recombination is given in [DLJD00]. 

Although real-valued encoding is a problem representation which is espe- 
cially suited for evolution strategies or particle swarm optimization rather 
than for GAs, a lot of operators have been established also for GAs which are 
quite similar to modern implementations of ES that make use of recombina- 
tion [Bey01]. Real-valued encoding for GAs distinguishes itself from typical 
discrete representations for combinatorial optimization problems in that point 
that the evolvement of longer and longer building block sequences in terms of 
adjacent alleles is of minor or no importance. Nevertheless, GA-based tech- 
niques like offspring selection have proven to be a very powerful optimization 
technique also for this kind of problem representation especially in case of 
highly multimodal fitness landscapes [AW05]. 


1.6 GA Theory: Schemata and Building Blocks 


Researchers working in the field of GAs have put a lot of effort into the 
analysis of the genetic operators (crossover, mutation, selection). In order to 
achieve better analysis and understanding, Holland has introduced a construct 
called schema [Hol75]: 

Under the assumption of a canonical GA with binary string representation 

of individuals, the symbol alphabet {0,1,#} is considered where {#}(don’t 
care) is a special wild card symbol that matches both, 0 and 1. 
A schema is a string with fixed and variable symbols. For example, the schema 
[0117401] is a template that matches the following four strings: [0011001], 
[0011101], [0111001], and [0111101]. The symbol # is never actually manip- 
ulated by the genetic algorithm; it is just a notational device that makes it 
easier to talk about families of strings. 

Essentially, Holland’s idea was that every evaluated string actually gives 
partial information about the fitness of the set of possible schemata of which 


Simulating Evolution: Basics about Genetic Algorithms 15 


the string is a member. Holland analyzed the influence of selection, crossover, 
and mutation on the expected number of schemata, when going from one 
generation to the next. A detailed discussion of related analysis can be found 
in [Gol89]; in the context of the present work we only outline the main results 
and their significance. 

Assuming fitness proportional replication, the number m of individuals of 
the population belonging to a particular schema H at time t+ 1 is related to 
the same number at the time t as 

fu(t) 
m(H,t+ 1) = m(A,t)=— 1.1 
(Ht +1) = mH (1.1) 

where f(t) is the average fitness value of the string representing schema H, 
while f(t) is the average fitness value over all strings within the population. 
Assuming that a particular schema remains above the average by a fixed 
amount cf(t) for a number t of generations, the solution of the equation given 
above can be formulated as the following exponential growth equation: 


m(H,t) = m(H,0)(1 +c) (1.2) 


where m(H,0) stands for the number of schemata H in the population at 
time 0, c denotes a positive integer constant, and t > 0. 

The importance of this result is the exponentially increasing number of 
trials to above average schemata. 

The effect of crossover which breaks strings apart (at least in the case of 
canonical genetic algorithms) is that they reduce the exponential increase by 
a quantity that is proportional to the crossover rate pe and depends on the 
defining length ô of a schema on the string of length I: 


pot) (1.3) 


The defining length 6 of a schema is the distance between the first and 
the last fixed string position. For example, for the schema [#740#0101] 
ô= 9—4 = 5. Obviously, short defining length schemata are less likely to 
be disrupted by a single point crossover operator. The main result is that 
above average schemata with short defining lengths will still be sampled at an 
exponential increasing rate. These schemata with above average fitness and 
short defining length are the so-called building blocks and play an important 
role in the theory of genetic algorithms. 

The effects of mutation are described in a rather straightforward way: If 
the bit mutation probability is pm, then the probability of survival of a single 
bit is 1 — pm; since single bit mutations are independent, the total survival 
probability is therefore (1 — pm)! with | denoting the string length. But in the 
context of schemata only the fixed, i.e., non-wildcard, positions matter. This 
number is called the order o( H) of schema H and equals to | minus the number 
of “don’t care” symbols. Then the probability of surviving a mutation for a 
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certain schema H is (1 — pm)°\”) 
for Pm <1. 

Summarizing the described effects of mutation, crossover, and reproduction, 
we end up with Holland’s well known schema theorem [Hol75]: 

m(H,t+1)> m(H, 1) Oy — p SD — 0( H)p | (1.4) 
F(t) l-1 

The result essentially says that the number of short schemata with low order 
and above average quality grows exponentially in subsequent generations of a 
genetic algorithm. 

Still, even if the schema theorem is a very important result in GA theory, it 
is obtained under idealized conditions that do not hold for most practical GA 
applications. Both the individual representation and the genetic operators are 
often different from those used by Holland. The building block hypothesis has 
been found reliable in many cases but it also depends on the representation and 
on the genetic operators. Therefore, it is easy to find or to construct problems 
for which it is not verified. These so-called deceptive problems are studied in 
order to find out the inherent limitations of GAs, and which representations 
and operators can make them more tractable. A more detailed description of 
the underlying theory can for instance be found in [Raw91] or [Whi93]. 

The major drawback of the building block theory is given by the fact 
that the underlying GA (binary encoding, proportional selection, single-point 
crossover, strong mutation) is applicable only to very few problems as it re- 
quires more sophisticated problem representations and corresponding oper- 
ators to tackle challenging real-world problems. Therefore, a more general 
theory is an intense topic in GA research since its beginning. Some theo- 
retically interesting approaches like the forma theory of Radcliffe and Surry 
[RS94], who consider a so-called forma as a more general schema for arbitrary 
representations, state requirements to the operators, which cannot be fulfilled 
for practical problems with their respective constraints. 

By the end of the last millennium, Stephens and Waelbroeck ([SW97], 
[SW99]) developed an exact GA schema theory. The main idea is to de- 
scribe the total transmission probability a of a schema H so that a(H,t) is 
the probability that at generation t the individuals of the GA’s population 
will match H (for a GA working on fixed-length bit strings). Assuming a 
crossover probability pro, a(H,t) is calculated ast: 


which can be approximated by 1 — o(H)pm 


N-1 
a(H,t) = (1 — pro)p(H, t) + > D> PUCH, i), (RUA), 8) (1.5) 


with L(H,7) and R(H,i) being the left and right parts of schema H, respec- 
tively, and p(H,t) the probability of selecting an individual matching H to 


1 We here give the slightly modified version as stated in [LP02]; it is equivalent to the results 
in [SW97] and [SW99] assuming pm = 0. 
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become a parent. The “left” part of a schema H is thereby produced by re- 
placing all elements of H at the positions from the given index 7 to N with 
“don’t care” symbols (with N being the length of the bit strings); the “right” 
part of a schema H is produced by replacing all elements of H from position 
1 toi with “don’t care.” The summation is over all positions from 1 to N —1, 
i.e., over all possible crossover points. 

Stephens later generalized this GA schema theory to variable-length GAs; see 
for example [SPWR02]. 


Keeping in mind that the ultimate goal of any heuristic optimization tech- 
nique is to approximately and efficiently solve highly complex real-world prob- 
lems rather than stating a mathematically provable theory that holds only 
under very restricted conditions, our intention for an extended building block 
theory is a not so strict formulation that in return can be interpreted for ar- 
bitrary GA applications. At the same time, the enhanced variants of genetic 
algorithms and genetic programming proposed in this book aim to support the 
algorithms in their intention to operate in the sense of an extended building 
block interpretation discussed in the following chapters. 


1.7 Parallel Genetic Algorithms 


The basic idea behind many parallel and distributed programs is to divide 
a task into partitions and solve them simultaneously using multiple proces- 
sors. This divide-and-conquer approach can be used in different ways, and 
leads to different methods to parallelize GAs where some of them change the 
behavior of the GA whereas others do not. Some methods (as for instance 
fine-grained parallel GAs) can exploit massively parallel computer architec- 
tures, while others (coarse-grained parallel GAs, e.g.) are better qualified for 
multi-computers with fewer and more powerful processing elements. Detailed 
descriptions and classifications of distributed GAs are given in [CP01], [CP97] 
or [AT99] and [Alb05]; the scalability of parallel GAs is discussed in [CPG99]. 
A further and newer variant of parallel GAs which is based on offspring selec- 
tion (see Chapter 4) is the so-called SASEGASA algorithm which is discussed 
in Chapter 5. 


In a rough classification, parallel GA concepts established in GA textbooks 
(as for example [DLJD00]) can be classified into global parallelization, coarse- 
grained parallel GAs, and fine-grained parallel GAs, where the most popular 
model for practical applications is the coarse-grained model, also very well 
known as the island model. 
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FIGURE 1.3: Global parallelization concepts: A panmictic population struc- 
ture (shown in left picture) and the corresponding master-slave model (right 
picture). 


1.7.1 Global Parallelization 


Similar to the sequential GA, in the context of global parallelization there is 
only one single panmictic? population and selection considers all individuals, 
i.e., every individual has a chance to mate with any other. The behavior 
of the algorithm remains unchanged and the global GA has exactly the same 
qualitative properties as a sequential GA. The most common operation that is 
parallelized is the evaluation of the individuals as the calculation of the fitness 
of an individual is independent from the rest of the population. Because of 
this the only necessary communication during this phase is in the distribution 
and collection of the workload. 

One master node executes the GA (selection, crossover, and mutation), and 
the evaluation of fitness is divided among several slave processors. Parts of 
the population are assigned to each of the available processors, in that they 
return the fitness values for the subset of individuals they have received. Due 
to their centered and hierarchical communication order, global parallel GAs 
are also known as single-population master-slave GAs. 

Figure 1.3 shows the population structure of a master-slave parallel GA: 
This panmictic GA has all its individuals (indicated by the black spots) in the 
same population. The master stores the population, executes the GA opera- 
tions, and distributes individuals to the slaves; the slaves compute the fitness 
of the individuals. As a consequence, global parallelization can be efficient 
only if the bottleneck in terms of runtime consumption is the evaluation of 
the fitness function. 

Globally parallel GAs are quite easy to implement, and they can be a quite 
efficient method of parallelization if the evaluation requires considerable com- 
putational effort compared to the effort required for the operations carried out 
by the master node. However, they do not influence the qualitative properties 
of the corresponding sequential GA. 


2In general, a population is called panmictic when all individuals are possible mating part- 
ners. 
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1.7.2 Coarse-Grained Parallel GAs 
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FIGURE 1.4: Population structure of a coarse-grained parallel GA. 


In the case of a coarse-grained parallel GA, the population is divided into 
multiple subpopulations (also called islands or demes) that evolve mostly 
isolated from each other and only occasionally exchange individuals during 
phases called migration. This process is controlled by several parameters 
which will be explained later in Section 1.7.4. In contrast to the global paral- 
lelization model, coarse-grained parallel GAs introduce fundamental changes 
in the structure of the GA and have a different behavior than a sequential GA. 
Coarse-grained parallel GAs are also known as distributed GAs because they 
are usually implemented on computers with distributed memories. Litera- 
ture also frequently uses the notation “island parallel GAs” because there is a 
model in population genetics called the island model that considers relatively 
isolated demes. 


Figure 1.4 schematically shows the design of a coarse-grained parallel GA: 
Each circle represents a simple GA, and there is (infrequent) communication 
between the populations. The qualitative performance of a coarse-grained 
parallel GA is influenced by the number and size of its demes and also by 
the information exchange between them (migration). The main idea of this 
type of parallel GAs is that relatively isolated demes will converge to differ- 
ent regions of the solution-space, and that migration and recombination will 
combine the relevant solution parts [SWM91]. However, at present there is 
only one model in the theory of coarse-grained parallel GAs that considers 
the concept of selection pressure for recombining the favorable attributes of 
solutions evolved in the different demes, namely the SASEGASA algorithm 
(which will be described later in Chapter 5). Coarse-grained parallel GAs 
are the most frequently used parallel GA concept, as they are quite easy to 
implement and are a natural extension to the general concept of sequential 
GAs making use of commonly available cluster computing facilities. 
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1.7.3 Fine-Grained Parallel GAs 


FIGURE 1.5: Population structure of a fine-grained parallel GA; the special 
case of a cellular model is shown here. 


Fine-grained models consider a large number of very small demes; Figure 1.5 
sketches a fine-grained parallel GA. This class of parallel GAs has one spatially 
distributed population; it is suited for massively parallel computers, but it 
can also be implemented on other supercomputing architectures. A typical 
example is the diffusion model [Miih89] which represents an intrinsic parallel 
GA-model. 


The basic idea behind this model is that the individuals are spread through- 
out the global population like molecules in a diffusion process. Diffusion 
models are also called cellular models. In the diffusion model a processor 
is assigned to each individual and recombination is restricted to the local 
neighborhood of each individual. 


A recent research topic in the area of parallel evolutionary computation is 
the combination of certain aspects of the different population models resulting 
in so-called hybrid parallel GAs. Most of the hybrid parallel GAs are coarse- 
grained at the upper level and fine-grained at the lower levels. Another way 
to hybridize parallel GAs is to use coarse-grained GAs at the high as well as 
at the low levels in order to force stronger mixing at the low levels using high 
migration rates and a low migration rate at the high level [CP01]. Using this 
strategy, computer cluster environments at different locations can collectively 
work on a common problem with only little communication overhead (due to 
the low migration rates at the high level). 
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1.7.4 Migration 


Especially for coarse-grained parallel GAs the concept of migration is con- 
sidered to be the main success criterion in terms of achievable solution quality. 
The most important parameters for migration are: 


The communication topology which defines the interconnections be- 
tween the subpopulations (demes) 


e The migration scheme which controls which individuals (best, random) 
migrate from one deme to another and which individuals should be 
replaced (worst, random, doubles) 


e The migration rate which determines how many individuals migrate 


e The migration interval or migration gap that determines the frequency 
of migrations 


The most essential question concerning migration is when and to which ex- 
tent migration should take place. Much theoretical work considering this has 
already been done; for a survey of these efforts see [CP97] or [Alb05]. It is 
very usual for parallel GAs that migration occurs synchronously meaning that 
it occurs at predetermined constant intervals. However, synchronous migra- 
tion is known to be slow and inefficient in some cases [AT99]. Asynchronous 
migration schemes perform communication between demes only after specific 
events. The migration rate which determines how many individuals undergo 
migration at every exchange can be expressed as a percentage of the popula- 
tion size or as an absolute value. The majority of articles in this field suggest 
migration rates between 5% and 20% of the population size. However, the 
choice of this parameter is considered to be very problem dependent [AT99]. 
A recent overview of various migration techniques is given in [CPO1]. 


Recent theory of self-adaptive selection pressure steering (see Chapters 4 
and 5) plays a major role in defying the conventions of recent parallel GA- 
theory. Within these models it becomes possible to detect local premature 
convergence, i.e., premature convergence in a certain deme. Thus, local pre- 
mature convergence can be detected independently in all demes, which should 
give a high potential in terms of efficiency especially for parallel implementa- 
tions. Furthermore, the fact that selection pressure is adjusted self-adaptively 
with respect to the potential of genetic information stored in the certain demes 
makes the concept of a parallel GA much more independent in terms of mi- 
gration parameters (see [Aff05] and Chapter 5). 
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1.8 The Interplay of Genetic Operators 


In order to allow an efficient performance of a genetic algorithm, a beneficial 
interplay of exploration and exploitation should be possible. Critical factors 
for this interplay are the genetic operators selection, crossover, and mutation. 


The job of crossover is to advantageously combine alleles of selected (above 
average) chromosomes which may stem from different regions of the search 
space. Therefore, crossover is considered to rather support the aspect of 
breadth search. Mutation slightly modifies certain chromosomes at times and 
thus brings new alleles into the gene pool of a population in order to avoid 
stagnation. As mutation modifies the genetic make-up of certain chromosomes 
only slightly it is primarily considered as a depth search operator. However, 
via mutation newly introduced genetic information does also heavily support 
the aspect of breadth search if crossover is able to “transport” this new genetic 
information to other chromosomes in other search space regions. As we will 
show later in this book, this aspect of mutation is of prime importance for an 
efficient functioning of a GA. 


The aspect of migration in coarse-grained parallel GAs should also be men- 
tioned in our considerations about the interplay of operators. In this kind 
of parallel GAs, migration functions somehow like a meta-model of mutation 
introducing new genetic information into certain demes at the chromosome- 
level whereas mutation introduces new genetic information at the allele level. 
Concerning migration, a well-adjusted interplay between breadth and depth 
search is aimed to function in the way that breadth search is supported in 
the intra-migration phases by allowing the certain demes to drift to different 
regions of the search space until a certain stage of stagnation is reached; the 
demes have expanded over the search space. Then migration comes into play 
by introducing new chromosomes stemming from other search space regions 
in order to avoid stagnation in the certain demes; this then causes the demes 
to contract again slightly which from a global point of view tends to support 
the aspect of depth search in the migration phases. The reason for this is that 
migration causes an increase of genetic diversity in the specific demes on the 
one hand, but on the other hand it decreases the diversity over all islands. 
This global loss of genetic diversity can be interpreted as an exploitation of 
the search space. 


This overall strategy is especially beneficial in case of highly multimodal 
search spaces as it is the case for complex combinatorial optimization prob- 
lems. 
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1.9 Bibliographic Remarks 


There are numerous books, journals, and articles available that survey the 
field of genetic algorithms. In this section we summarize some of the most 
important ones. Representatively, the following books are widely considered 
very important sources of information about GAs (in chronological order): 


e J. H. Holland: Adaptation in Natural and Artificial Systems [Hol75] 


e D. E. Goldberg: Genetic Algorithms in Search, Optimization and Ma- 
chine Learning [Gol89] 


e Z. Michalewicz: Genetic Algorithms + Data Structures = Evolution 
Programs [Mic92] 


e D. Dumitrescu et al.: Evolutionary Computation [DLJDO0] 


The following journals are dedicated to either theory and applications of 
genetic algorithms or evolutionary computation in general: 


e IEEE Transactions on Evolutionary Computation (IEEE) 
e Evolutionary Computation (MIT Press) 
e Journal of Heuristics (Springer) 


Moreover, several conference and workshop proceedings include papers re- 
lated to genetic and evolutionary algorithms and heuristic optimization. Some 
examples are the following ones: 


e Genetic and Evolutionary Computation Conference (GECCO), a recom- 
bination of the International Conference on Genetic Algorithms and the 
Genetic Programming Conference 


e Congress on Evolutionary Computation (CEC) 
e Parallel Problem Solving from Nature (PPSN) 


Of course there is a lot of GA-related information available on the inter- 
net including theoretical background and practical applications, course slides, 
and source code. Publications of the Heuristic and Evolutionary Algorithms 
Laboratory (HEAL) (including several articles on GAs and GP) are available 
at http://www.heuristiclab.com/publications/. 


Chapter 2 


Evolving Programs: Genetic 
Programming 


In the previous chapter we have summarized and discussed genetic algorithms; 
it has been illustrated how this kind of algorithms is able to produce high 
quality results for a variety of problem classes. 

Still, a GA is by itself not able to handle one of the most challenging tasks 
in computer science, namely getting a computer to solve problems without 
programming it explicitly. As Arthur Samuel stated in 1959 [Sam59], this 
central task can be formulated in the following way: 


How can computers be made to do what needs to be done, 
without being told exactly how to do it? 


In this chapter we give a compact description and discussion of an extension 
of the genetic algorithm called genetic programming (GP). Similar to GAs, 
genetic programming works on populations of solution candidates for a given 
problem and is based on Darwinian principles of survival of the fittest (selec- 
tion), recombination (crossover), and mutation; it is a domain-independent, 
biologically inspired method that is able to create computer programs from a 
high-level problem statement.1 

Research activities in the field of genetic programming started in the 1980s; 
still, it took some time until GP was widely received by the computer sci- 
ence community. Since the beginning of the 1990s GP has been established 
as a human-competitive problem solving method. The main factors for its 
widely accepted success in the academic world as well as in industries can be 
summarized in the following way [Koz92b]: 


e Virtually all problems in artificial intelligence, machine learning, adap- 
tive systems, and automated learning can be recast as a search for com- 
puter programs, and 


e genetic programming provides a way to successfully conduct the search 
in the space of computer programs. 


1Please note that we here in general see computer programs as entities that receive inputs, 
perform computations, and produce output. 
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In the following we 


e give an overview of the main ideas and foundations of genetic program- 
ming in Sections 2.1 and 2.2, 


e summarize basic steps of the GP-based problem solving process (Sec- 
tion 2.3), 


e report on typical application scenarios (Section 2.4), 

e explain theoretical foundations (GP schema theories, Section 2.5), 
e discuss current GP challenges and research areas in Section 2.6, 

e summarize this chapter on GP in Section 2.7, and finally 


e refer to a range of outstanding literature in the field of theory and praxis 
of GP in Section 2.8. 


2.1 Introduction: Main Ideas and Historical 
Background 


As has already been mentioned, one of the central tasks in artificial intel- 
ligence is to make computers do what needs to be done without telling them 
exactly how to do it. This does not seem to be unnatural since it demands of 
computers to mimic the human reasoning process - humans are able to learn 
what needs to be done, and how to do it. In short, interactions of networks 
of neurons are nowadays believed to be the basis of human brain information 
processing; several of the earliest approaches in artificial intelligence aimed at 
imitating this structure using connectionist models and artificial neural net- 
works (ANNs, [MP43]). Suitable network training algorithms enable ANNs 
to learn and generalize from given training examples; ANNs are in fact a 
very successful distributed computation paradigm and are frequently used in 
real-world applications where exact algorithmic approaches are too difficult to 
implement or even not known at all. Pattern recognition, classification, data- 
based modeling (regression) are some examples of AI areas in which ANNs 
have been applied in numerous ways. Unlike this network-based approach, 
genetic algorithms were developed using main principles of natural evolution. 
As has been explained in Chapter 1, GAs are population-based optimization 
algorithms that imitate natural evolution: Starting with a primordial ooze of 
thousands of randomly created solution candidates appropriate to the respec- 
tive problem, populations of solutions are progressively evolved over many 
generations using the Darwinian principles. 
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Similar to the GA, GP is an evolutionary algorithm inspired by biological 
evolution to find computer programs that perform a user-defined computa- 
tional task. It is therefore a machine learning technique used to optimize a 
population of computer programs according to a fitness landscape determined 
by a program’s ability to perform the given task; it is a domain-independent, 
biologically inspired method that is able to create computer programs from a 
high-level problem statement (with computer programs being here defined as 
entities that receive inputs, perform computations, and produce output). 

The first research activities in the context of GP have been reported in 
the early 1980s. For example, Smith reported on a learning system based on 
GAs [Smi80], and in [For81] Forsyth presented a computer package producing 
decision-rules (i.e., small computer programs) in forensic science for the UK 
police by induction from a database (where these rules are Boolean expressions 
represented by tree structures). In 1985, Cramer presented a representation 
for the adaptive generation of simple sequential programs [Cra85]; it is widely 
accepted that this article on genetic programming is the first paper to de- 
scribe the tree-like representation and operators for manipulating programs 
by genetic algorithms. 

Even though there was noticeable research activity in the field of GP going 
on by the middle of the 1980s, still it took some time until GP was widely 
received by the computer science community. GP is very intensive from a 
computational point of view and so it was mainly used to solve relatively 
simple problems until the 1990s. But thanks to the enormous growth in CPU 
power that has been going on since the 1980s, the field of applications for GP 
has been extended immensely yielding human competitive results in areas such 
as data-based modeling, electronic design, game playing, sorting, searching, 
and many more; examples (and respective references) are going to be given 
in the following sections. 

One of the most important GP publications was “Genetic Programming: 
On the Programming of Computers by Means of Natural Selection” [Koz92b] 
by John R. Koza, professor for computer science and medical informatics at 
Stanford University who has since been one of the main proponents of the 
GP idea. Based on extensive theoretical background as well as test results 
in many different problem domains he demonstrated GP’s ability to serve as 
an automated invention machine producing novel and outstanding results for 
various kinds of problems. By now there have been three more books on GP 
by Koza (and his team), but also several other very important publications 
(for example by Banzhaf, Langdon, Poli and many others); a short summary 
is given in Section 2.8. 

Along with these ad hoc engineering approaches there was an increasing 
interest in how and why GP works. Even though GP was applied successfully 
for solving problems in various areas, the development of a GP theory was 
considered rather difficult even through the 1990s. Since the early 2000s it 
has finally been possible to establish a theory of GP showing a rapid develop- 
ment since then. A book that has to be mentioned in this context is clearly 
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“Foundations of Genetic Programming” [LP02] by Langdon and Poli since it 
presents exact GP schema analysis. 

As we have now summarized the historical background of GP, it is now 
high time to describe how it really works and how typical applications are 
designed. This is exactly what the reader can find in the following sections. 


2.2 Chromosome Representation 


As in the context of any GA-based problem solving process, the representa- 
tion of problem instances and solution candidates is a key issue also in genetic 
programming. On the one hand, the representation scheme should enable the 
algorithm to find suitable solutions for the given problem class, but on the 
other hand the algorithm should be able to directly manipulate the coded 
solution representation. The use of fixed-length strings (of bits, characters, 
or integers, e.g.) enables the conventional GA to solve a huge amount of 
problems and also allows the construction of a solid theoretical foundation, 
namely the schema theorem. Still, in the context of GP the most natural 
representation for a solution is a hierarchical computer program of variable 
size [Koz92b]. 


2.2.1 Hierarchical Labeled Structure Trees 
2.2.1.1 Basics 


So, how can hierarchical computer programs be represented? The repre- 
sentation that is most common in literature and is used by Koza ([Koz92b], 
[Koz94], [KIAK99], [KKS*03b]), Langdon and Poli ([LP02]), and many other 
authors is the point-labeled structure tree. Originally, these structure trees 
were for example seen as graphical representations of so-called S-expressions 
of the programming language LISP ([McC60], [Que03], [WH87]) which have 
for example been used by Koza in [Koz92b] and [Koz94].? Here we do not 
strictly stick to LISP-syntax for the examples given, but the main paradigms 
of S-expressions are used. 

The following key facts are relevant in the context of structure tree based 
genetic programming: 


e All tree-nodes are either functions or terminals. 


2Īn fact, of course, any higher programming language is suitable for implementing a 
GP-framework and for representing hierarchical computer programs. Koza, for example, 
switched to the C programming language as described in [KIAK99], and the HeuristicLab 
framework and the GP-implementation, which is realized as plug-ins for it, are programmed 
in C# using the .NET framework - this is to be explained in further detail later. 
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e Terminals are evaluated directly, i.e., their return values can be calcu- 
lated and returned immediately. 


e All functions have child nodes which are evaluated before using the 
children’s calculated return values as inputs for the parents’ evaluation. 


e The probably most convenient string representation is the prefix no- 
tation, also called Polish or Eukasiewicz* notation: Function nodes are 
given before the child nodes’ representations (optionally using parenthe- 
ses). Evaluation is executed recursively, depth-first way, starting from 
the left; operators are thus placed to the left of their operands. 

In case of fixed arities of the functions (i.e., if the numbers of function’s 
inputs is fixed and known), no parentheses or brackets are needed. 


In a more formal way this program representation structure schema can be 
summarized as follows [ES03}: 


e Symbolic expressions can be defined using 


— a terminal set T, and 
— a function set F. 


e The following general recursive definition is applied: 


— Every t € T is a correct expression, 


— f(e1,...,e€n) is a correct expression if f € F, arity(f) = n and 
€1,.--,€n are correct expressions, and 


— there are no other forms of correct expressions. 


e In general, expressions in GP are not typed (closure property: any f € F 
can take any g € F as argument). Still, as we see in the discussion of 
genetic operators in Section 2.2.1.3, this might be not true in certain 
cases depending on the function and terminal sets chosen. 


In the following we give exemplary simple programs. We thereby give con- 
ventional as well as prefix (not exactly following LISP notation) textual no- 
tations: 


e (a) IF (Y>X OR Y<4) THEN i:=(i+1), ELSE i:=0. 
Prefix notation: IF(OR(>(Y,X) ,<(Y,4)), :=(i,+(i,1)),:=(i,0)). 


e (b) 42. Prefix notation: DIV(ADD(X,5) ,MULT(2,¥)). 


Graphical representations of the programs (given as rooted, point-labeled 
structure trees) are given in Figure 2.1. 


3 Jan Lukasiewicz (1878-1956), a Polish mathematician, invented the prefix notation which 
is also the basis of the recursive stack (“last in, first out”; [Ham58], [Ham62]). In reference 
to his nationality the notation is also referred to as “Polish” notation. 
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FIGURE 2.1: Exemplary programs given as rooted, labeled structure trees. 


2.2.1.2 Evaluation 


As already mentioned previously, the execution (evaluation) of GP chromo- 
somes representing hierarchical computer programs as structure trees is done 
recursively, depth-first way, and starting from the left. In order to demon- 
strate this we here simulate the evaluation of the example programs given in 
Section 2.2.1.1; graphical representations are given in Figures 2.2 and 2.3. 


e (a) Internal states before execution: X = 7, Y = 3, i =2. 
Execution: 
IFCOR(>(Y,X),<(Y,4)),:=€1,+G,1)),:=G,0)) 


=> IF(OR(>(3,7) ,<(Y,4)), :=(i,+(4,1)),:=(1,0)) 
=> IF(OR(FALSE,<(Y,4)),:=(i,+(i,1)),:=G,0)) 
=> IF(OR(FALSE,<(3,4)),:=(i,+(i,1)),:=(,0)) 
= IF(OR(FALSE, TRUE) ,:=(i,+(i,1)),:=(i,0)) 

=> IF(TRUE, :=(i,+(i,1)),:=(i,0)) 

=> :=(i,+(i,1)) 

=> :=(i1,+(2,1)) 

=> :=(i,3). 


Internal states after execution: X = 7, Y =3,i=3. 


e (b) Internal states before execution: X = 7, Y = 3. 
Execution: 
DIV(ADD(X,5) ,MULT(2,Y)) 
=> DIV(ADD(7,5) ,MULT(2,Y)) 
= DIV(12,MULT(2,Y)) 
=> DIV(12,MULT(2,3)) 
=> DIV(12,6) 
=>2 
Return value: 2; internal states after execution: X = 7, Y = 3. 
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FIGURE 2.2: Exemplary evaluation of program ( 


2.2.1.3 Genetic Operations: Crossover and Mutation 


As genetic programming is an extension to the genetic algorithm, GP also 
uses two main operators for producing new solution candidates in the search 
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FIGURE 2.3: Exemplary evaluation of program (b). 


space, namely crossover and mutation. 

As we already know from Chapter 1, crossover, the most important repro- 
duction operator, takes two parent individuals and produces new offspring by 
swapping parts of the parents. Here we immediately see one of the major 
advantages of hierarchical tree representations of computer programs: Single- 
point crossover can be simply performed by replacing a subtree of (a copy 
of) one of the parents by a subtree of the other parent; these subtrees are 
chosen at random. There are several different strategies for selecting these 
subtrees as it might be reasonable to choose either rather small, rather big, 
or completely randomly chosen parts. 

Mutation can be seen as an arbitrary modification introduced to prevent 
premature convergence by randomly sampling new points in the search space. 
In the case of genetic programming, mutation is applied by modifying a ran- 
domly chosen node of the respective structure tree: 


e A subtree could be deleted or replaced by a randomly re-initialized sub- 
tree. 


e A function node could for example change its function type or turn into 
a terminal node. 


Numerous other mutation variants are possible, many of them depending 
on the problem and chromosome representation chosen. In Chapter 11, for 
example, we describe mutation variants applicable for GP-based structure 
identification (related to symbolic regression, see Section 2.4.3). 

Figure 2.4 illustrates examples for sexual reproduction using the exemplary 
programs (1) and (2) as parents, labeled as parent1 and parent2, respectively. 
It thereby becomes obvious that in the context of GP there can be the chance 
of creating invalid chromosomes: The second offspring (child?) seems to be 
incorrect since it includes the comparison of a Boolean value (Y>X OR Y<4) 
and a number (2*Y). Thus, also in GP there are certain constraints that affect 
the crossover of solution candidates; these constraints have to be considered 
when it comes to designing and implementing a GP framework. 

Of course, it again depends on the chosen implementation, if the evaluation 
of this syntactically dubious program can be executed or not. In case of 
real-valued representation of Boolean values (TRUE represented by 1.0, FALSE 
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represented by 0.0, e.g.) this structure tree represents a valid program that 
can be calculated without any further problems. 

Figure 2.5 illustrates exemplary results of applying mutation to program 

(1). In the first case, a Boolean function node (<) is turned into another 
type of Boolean function node (>) yielding mutant1; mutant2 is produced by 
omitting a subtree, namely the second child of the OR function node. While 
these two first mutants are syntactically correct, mutant’ is an example for 
an invalid mutation example: The first child of the conditional (JF) node has 
been deleted leaving the root node with only two children - the evaluation of 
this program is not possible. 
Again, real-valued representation of Boolean values can help here. In this case 
the value calculated by the first child of such a conditional node would have 
to be interpreted as a Boolean value triggering the execution of the second 
child subtree, the then-branch. As there is no third child node there is also 
no else-branch, thus there is probably no action if the first (condition) node 
is evaluated (or at least interpreted) as false. 

These two examples of syntactically incorrect programs demonstrate what 
was hinted in Section 2.2.1.1: Even though expressions are in general not 
typed in GP, there are cases in which this is not true - a fact which has to 
be considered during the design and implementation of a GP-based problem 
solving system. 


2.2.1.4 Advantages 


As we are going to see later, the hierarchical structure tree is not the only 
way how programs can be modeled and used in the GP process. Still, the 
cumulation of the following reasons strongly favors the choice of this program 
representation schema’: 


e Even though structure trees show an (at least for many people) rather 
unusual appearance and syntax, most programming language compilers 
internally convert given programs into parse trees representing the un- 
derlying programs (i.e., their compositions of functions and terminals). 
In most programming languages, these parse trees are not (conveniently) 
accessible to the programmer; here we present the programs directly as 
parse trees as we need to genetically manipulate parts of the programs 
(sub-trees). 


e As evaluation is executed recursively starting from the root node, a 
newly generated or manipulated program can be (re-)evaluated imme- 
diately without any intermediate transformation step. 


4In fact, these reasons partially correlate to Koza’s reasons for choosing LISP for his GP 
implementation reported on in [Koz92b] and [Koz94], for example. 
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(1) parent 1 parent 2 


parent 1 © parent 2 


FIGURE 2.4: Exemplary crossover of programs (1) and (2) labeled as par- 
enti and parent2, respectively. Child1 and child2 are possible new offspring 
programs formed out of the genetic material of their parents. 


e Structure trees allow the representation of programs whose size and 
shape change dynamically. 
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parent 1 


FIGURE 2.5: Exemplary mutation of a program: The programs mutanti, 
mutant2, and mutant3 are possible mutants of parent. 


2.2.2 Automatically Defined Functions and Modular 
Genetic Programming 


Numerous variations and extensions to Koza’s structure tree based genetic 
programming have been proposed since its publication at the beginning of the 
1990s. The probably best known and most frequently used one is the concept 
of automatically defined functions (ADFs) proposed in “Genetic Programming 
II: Automatic Discovery of Reusable Programs” [Koz94]. 

The main idea of ADFs is that program code (which has been evolved during 
the GP process) is organized into useful groups (subroutines); this enables the 
parameterized reuse and hierarchical invocation of evolved code as functions 
that have not been taken from the original functions set F but are rather 
defined automatically. The (re-)use of subroutines (subprograms, procedures) 
is enabled in this way. In the meantime the idea of ADFs has been extended; 
automatically defined iterations, loops, macros, recursions, and stores have 
since then been proposed and their use demonstrated for example in [Koz94], 
[KIAK99], and [KKSt03b]. 

With ADFs a GP chromosome program is split into a main program tree 
(which is called and executed from outside) and arbitrarily many separate 
trees representing ADFs. These separate functions can take arguments as 
well as be called by the main program or another ADF. 
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Different approaches realizing modular genetic programming which have 
gained popularity and are well known in the GP community are the genetic 
library (presented by Angeline in [Ang93] and [Ang94], e.g.) and the adaptive 
representation through learning (ARL) algorithm (proposed by Rosca, see for 
example [Ros95a] or [RB96]). In both approaches, some parts of the evolved 
code are automatically extracted from programs (usually of those that show 
rather good fitness values). These extracted code fragments are then fixed 
and kept in the GP library, thus they are available for the evolving programs 
in the GP population. 

Other advanced GP concepts that extend the tree concept are discussed 
in [Gru94], [KBAK99], [Jac99], and [WC99]; basic features of these modular 
GP approaches can be combined with multi-tree (multi-agent) systems which 
shall be described a bit later. 


2.2.3 Other Representations 


We are not going to say much about GP systems that are not based on trees 
in the context of this book; still, the reader could be prone to suspect that 
there might be computer program representations other than the tree-based 
approach. In fact, there are two other forms of GP that shall be mentioned 
here whose program encoding differs significantly from the approach described 
before: Linear and graphical genetic programming. 


2.2.3.1 Linear Genetic Programming 


The main difference between linear GP and tree-based GP is that in linear 
GP individuals of the GP algorithm (the programs) are not represented by 
structure trees but by linear chromosomes. These linear solutions represent 
lists of computer instructions which are executed linearly. 

Linear GP chromosomes are more similar to those of conventional GAs; 
however, their size is usually not fixed so that a GP population is likely to 
contain chromosomes of different sizes which is usually not the case with 
conventional GA approaches. On the one hand this of course brings along 
the loss of the advantages mentioned in Section 2.2.1.4, but on the other 
hand this schema easily enables the representation of stack-based programs, 
register-based programs, and machine code. 


e In general, a stack is a data structure based on the “last in first out” 
principle. If a program instruction is to be evaluated, it takes (pops) its 
arguments from the stack, performs the calculation, and writes back the 
result by adding (pushing) it back to the top of the stack. A chromosome 
in stack-based GP represents exactly such a stack-based program by 
storing the program instructions in a list and using a stack for executing 
the program. A typical example can be seen in Perkins’ article “Stack- 
Based Genetic Programming” [Per94]; a recent implementation has for 
example been presented in [HRv07]. 
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e Register-based and machine code GP are essentially similar [LP02]: In 
both cases data are stored in (a rather small number of) registers, and 
instructions read data from and write results back to these registers. 
Initially, a program’s inputs are written to registers, and after executing 
the program the results are given in one or more registers. The main 
difference between these two GP approaches is the following: 


— Programs in register-based GP (as also those of any other kind of 
GP system) have to be interpreted, i.e., they are executed indirectly 
or compiled before execution. 


— On the contrary, programs in machine code GP consist of real hard- 
ware machine instructions; thus, these programs can be executed 
directly on a computer. The execution of machine code GP pro- 
grams is therefore a lot faster than the evaluation of programs in 
traditional implementations. 

Nordin’s Compiling Genetic Programming System (CGPS) [Nor97] 
for example presents an implementation of machine code GP. 


2.2.3.2 Graphical Genetic Programming 


Parallel Distributed Graphical Programming (PDGP, [Pol97], [Pol99b]) is 
a form of GP in which programs are represented as graphs representing func- 
tions and terminals as nodes; links between those nodes define the flow of 
control and results. PDGP defines a fixed layout for the nodes whereas the 
connections between them and the referenced functions are evolved by the 
GP process. PDGP enables a high degree of parallelism as well as an efficient 
and effective reuse of partial results; furthermore, it has been shown that it 
performs better than conventional tree-based GP on a number of benchmark 
problems. 

Figure 2.6 shows the graphical representation of an exemplary program in 
PDGP (adapted from [Pol99b]). 


2.3 Basic Steps of the GP-Based Problem Solving 
Process 


2.3.1 Preparatory Steps 


Before the GP process can be started there are several preparatory steps 
that have to be executed. As explained in Section 2.2.1.1, the function and 
terminal sets (F and T, respectively) have to be determined. Furthermore, 
as in any GA application, a fitness measurement function also has to be es- 
tablished so that a solution candidate can be evaluated and its fitness can be 
measured (either explicitly or implicitly). 
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FIGURE 2.6: Intron-augmented representation of an exemplary program in 
PDGP [Pol99b]. 


In addition to these preparations that directly affect the construction and 
management of individuals of the GP population, there are also some things 
to be done regarding the execution of the GP algorithm: 


e Parameters that control the GP run have to be set, 
e a termination criterion has to be defined, and 


e a result designation method has to be defined (as explained later in 
Section 2.3.4). 


These preparations in fact have to be done for any genetic algorithm; a similar 
summary is for example given in [KIAK99]. Figure 2.7 summarizes the major 
preparatory steps for the basic GP process. 


Parameters Fitness Measure f 
Functions Set F 
Termination Criterion 
Terminals 
Set T SL oe. Results Designation 
(_approcess _) Process 


r 


Solution (Program) 


FIGURE 2.7: Major preparatory steps of the basic GP process. 
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2.3.2 Initialization 


At the beginning of each GA and GP execution, the population is ini- 
tialized arbitrarily before the intrinsic evolutionary process can be started. 
This initialization can be done either completely at random or using certain 
(problem-specific) heuristics. 

For hierarchical program structures as used in GP the random initialization 
utilizes a maximum initial tree depth Dmaz. As introduced in [Koz92b] and 
for example reflected on in [ES03], there are two possibilities for creating 
random initial programs: 


e Full method: Nodes at depth d < Daz point to randomly chosen func- 
tions from function set F, and nodes at depth d = Dmaz are randomly 
chosen terminals (from terminal set T); 


e Grow method: Nodes at depth d < Dmax become either a function or a 
terminal (randomly chosen from F U T), and nodes at depth d = Diaz 
are again randomly chosen terminals (from T). 


The so-called ramped half-half GP initialization method, proposed by Koza 
[Koz92b], has meanwhile become one of the most frequently used GP initial- 
ization approaches [ES03]. Both methods, grow and full, are hereby applied, 
each delivering parts of the initial population. 

Still, there is research work going on regarding this issue of finding optimal 
initialization techniques as it is a fact that the use of different initialization 
strategies can lead to very different overall results (as for example demon- 
strated in [HHM04]). For example, there are approaches that produce initial 
populations that are generated adequately distributed in terms of tree size 
and distribution within the search space [GAMRRP07]. 


2.3.3 Breeding Populations of Programs 


After preparing the GP process and initializing the population, the genetic 
process can be started. As it is the case in any GA, new individuals (programs) 
are created using recombination and mutation, tested, and become a part of 
the new population. Fitter individuals have a bigger chance to succeed in 
creating children of their own; thus, optimization happens during the run of 
the evolutionary algorithm. Unfit programs (and with them also their genetic 
material) wither out of the population. 

As populations cannot grow infinitely in most applications, new programs 
somehow have to replace old ones that die off. There are in fact several ways 
how this replacement can be done: 


e Generational replacement: The entire population is replaced by its de- 
scendants. This corresponds to generations changes in nature when 
for example annual plants or animals die in winter whereas their eggs 
(hopefully) survive; thus, the next generation of the species is founded. 
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e Steady state replacement: New individuals are produced continuously, 
and the removal of old individuals also happens continuously. Analogies 
in nature are obvious as this is more or less how for example human 
evolution happens. 


e Selection of replaced programs: The individuals removed can be either 
chosen from the unfit ones (worst replacement), from the older ones 
(replacement with aging), or at random (random replacement), for ex- 
ample. 


This whole procedure is graphically displayed in Figure 2.8 (adapted from 
[LP02]). 
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FIGURE 2.8: The genetic programming cycle [LP02]. 


In fact, the whole genetic programming process involves more than what is 
displayed in Figure 2.8: The preparatory steps summarized in Section 2.3.1 
also have to be considered, and of course a validation of the results produced 
has to be done that might lead to a re-formulation of the pre-conditions. A 
more comprehensive overview of the GP process is given in Figure 2.9. 

The execution of the GP cycle is — as GP is an extension to the GA — similar 
to the cyclic execution of the GA: Solutions are selected from the population, 
by crossing them they become parents, mutation is applied with a rather 
small probability, and thus a new offspring is produced. In the generational 
replacement scheme this is repeated until the next generation’s population 
is complete; in the steady state scheme there is no generational cycle but 
this procedure is also repeated over and over again. The whole procedure is 
repeated until some pre-defined termination criterion is met (see Section 2.3.4 
for details). 
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FIGURE 2.9: The GP-based problem solving process. 


In fact, there is a veritable difference in the descriptions of this cyclic work- 
flow for GAs and for GP regarding the offspring creation scheme applied?: 


e In GAs, crossover and mutation are used sequentially, i.e., both are 
applied (with mutation having a rather small probability). 


e In GP, crossover or mutation (or a simple copy action) are executed 
independently; each time a new offspring is to be created, one of these 
variants is chosen probabilistically. 


In fact, some researchers even recommend the GP-like offspring creation 
schema for all evolutionary computation systems (as for example given by 
Eick, see [Eic07]). 


2.3.4 Process Termination and Results Designation 


In general, the termination criteria of genetic algorithms are also applicable 
for genetic programming. A termination criterion might monitor the number 
of generations and terminate the algorithm as soon as a given limit is reached. 
Problem-specific criteria are also used frequently, i.e., the algorithm is termi- 
nated as soon as a problem-specific success predicate is fulfilled. In practice, 
one may manually monitor and manually terminate the run when the values of 
fitness for numerous successive best-of-generation individuals appear to have 
reached a plateau [KKSt03b]. 


5The GA workflow was described in detail in Chapter 1; the GP workflow as it is summarized 
here is also described in further detail in [Koz92b], [KKS*03b], and [ES03], for example. 
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FIGURE 2.10: GA and GP flowcharts: 


and genetic programming. 
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The conventional genetic algorithm 


After terminating the algorithm it comes to the designation of the result 
returned by the algorithm. Normally, the single best-so-far individual is then 
harvested and designated as the result of the run [KKS*03b]. As we will 
see in Chapter 11 there are applications (as for example data-based structure 
identification) in which this is not the optimal strategy. In this case the use of 
a validation data set V is suggested, i.e., a data collection that was not used 
during the GP training phase; we eventually test the programs on V and pick 
the one that performs best on V. 
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2.4 Typical Applications of Genetic Programming 


As genetic programming is a domain-independent method, there is an enor- 
mous number of applications for which it has been used for automatically pro- 
ducing solutions of high quality. Here we give a very short summary of exem- 
plary problem classes which have been used for demonstrating GP’s power in 
automatically learning programs for solving problems for more than 15 years, 
namely the automated learning of multiplexer functions (Section 2.4.1), the 
artificial ant (2.4.2), and symbolic regression (2.4.3). Finally, in Section 2.4.4 
we give a short list of various problems for which GP has proven to be able 
to produce high quality results. 


2.4.1 Automated Learning of Multiplexer Functions 


The automated learning of functions requires the development of composi- 
tions of functions that can return correct values of functions after seeing only 
a relatively small number of specific examples; these training samples are com- 
binations of values of the function associated with particular combinations of 
arguments. 

The problem of learning Boolean multiplexer functions has become famous 
as a benchmark application for genetic programming since Koza’s work on it 
for example presented in [Koz89] and [Koz92b]. The input to a Boolean 
k-multiplexer function is a bit-string consisting of k address bits a; and 
2* data bits dj; normally, the bits are thereby aligned following the form 
[Qp—1---@1Aodgx_1...d,do]. The value returned by the multiplexer function 
is the value of the particular data bit that is addressed by the k address bits. 
For example, let k be 3 and the three address bits aja,;a9 = 101, then the 
multiplexer singles out data bit ds to be its output. The abstract black box 
model of the Boolean multiplexer with three address bits and 23 = 8 data bits 
as well as the concrete addressing of data bit ds is displayed in Figure 2.11. 

A solution to this problem obviously has to be a function that uses input 
information a and d and calculates a Boolean return value. Thus, the terminal 
has (k + 2*) elements which correspond to the inputs to the multiplexer; in 
the case of k = 3 the terminal set T = {Ao, A1, A2, Do, Di,..., D7}. The 
functions used contain Boolean functions and the conditional function, i.e., 
F ={AND, OR, NOT, IF}. The evaluation of a solution candidate is done 
by applying the formula to all possible input bit combinations and counting 
the number of correct output values. As there are (k + 2") inputs to the 
Boolean multiplexer, the number of possible input combinations is (2k+2"). 
in the case of k = 3, the number of possible input combinations is 2048. 


6Data bit ds is in fact the sixth data bit since if aga ,a9 = 000 data bit do is addressed, so 
the indices of these data bits are zero-based. 
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(a) (b) 


— output de output 


FIGURE 2.11: The Boolean multiplexer with three address bits; (a) general 
black box model, (b) addressing data bit ds. 


Koza was able to show that GP is able to solve the 3-address multiplexer 
problem 100% correctly [Koz92b]; this optimal result is shown in Figure 2.12. 
Of course, various test series have been documented in which GP was used 
for solving problem with multiplexers with more address bits in numerous 
publications. 


(IF AO (IF A2 (IF Al D7 (IF A0 D5 DO)) (IF AO (IF Al (IF A2 D7 D3) D1) DO)) (IF 
A2 (IF Al D6 D4) (IF A2 D4 (IF Al D2 (IF A2 D7 DO))))) 


FIGURE 2.12: A correct solution to the 3-address Boolean multiplexer prob- 
lem [Koz92b]. 


2.4.2 The Artificial Ant 


The artificial ant problem ([CJ91a], [CJ91b], [JCCT92]) has also been a 
frequently used benchmark problem for GP since Koza’s application [Koz92b]; 
meanwhile, it has become a well-studied problem in the GP community (see 
for example [LW95], [IIS98], [Kus98], [LP98], and [LP02]). 

In short, the problem is to navigate an artificial ant on a grid consisting of 
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32 x 32 cells. The grid is toroidal so that if the ant moves off the edge of the 
grid, it reappears and continues on the opposite edge. On this grid, “food” 
units are distributed (normally along a trail); each time the ant enters a square 
containing food, the ant eats it. At the beginning of the ant’s wanderings it 
starts at cell (0,0) facing in a particular direction (east, e.g.); at each time 
step, the ant is able to move forward in the direction it is facing, to turn right, 
or to turn left. The goal is to find a program that is able to navigate the ant 
so that as many food items as possible are eaten in a certain number of time 
units. The program can use the following: 


e Three operations are available, namely Move, Left, and Right which let 
the ant move ahead, turn left, or turn right, respectively; these opera- 
tions are used as terminals in the GP process. 


e The sensing function IfFoodAhead investigates the cell the ant is cur- 
rently facing and then executes the first child operation if food is ahead 
or the second child action otherwise. 


e Additionally, two more functions are available: Prog2 and Prog3 take 
two and three arguments (operations), respectively, which are executed 
consecutively. 
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FIGURE 2.13: The Santa Fe trail. 


The most frequently used trail is the so-called “Santa Fe trail” designed 
by Christopher Langton. This trail is displayed in Figure 2.13 (adapted from 
[LP02]); the ant is allowed to wander around the map for 600 time units. This 
problem is in fact considered a hard problem for GP; thorough explanations 
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for this statement are for example given by Langdon and Poli in “Why ants 
are hard” ([LP98] and [LP02]). What makes it so hard is not that it is difficult 
to find correct solutions but rather to find these efficiently and significantly 
better than random search. As is listed in [LP98], the smallest solutions that 
solve the Santa Fe trail problem (i.e., those that provide programs that let 
the ant eat all food packets) are of length eleven’; one of them is exemplarily 
shown in Figure 2.14. 


FIGURE 2.14: A Santa Fe trail solution. The black points represent nodes 
referencing to the Prog3 function. 


Even though it is a very “simple” problem, the artificial ant problem still 
provides a good basis for many theoretical investigations in GP such as build- 
ing blocks and schema analysis [LP02], operators discussions ({LS97] or [S98], 
e.g.), further algorithmic development [C007], and many other research ac- 
tivities. 


2.4.3 Symbolic Regression 


In short, symbolic regression is the induction of mathematical expressions 
on data. The key feature of this technique is, as Keijzer summarized in [Kei02], 
that the object of search is a symbolic description of a model, not just a set 
of coefficients in a pre-specified model. This is in sharp contrast with other 


In fact, there are 2,554,416 possible programs with length 11, but only 12 (i.e., 0.00047%) 
of them are successes. For programs of length 14 this ratio is approximately 0.0007%, for 
bigger program sizes (up to 200 — 500) it levels off between 0.0001% and 0.0002% [LP98]. 
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methods of regression, including linear regression, polynomial approaches, or 
also artificial neural networks (ANNs), where a specific model is assumed and 
often only the complexity of this model can be varied. 

The main goal of regression in general is to determine the relationship 
of a dependent (target) variable ¢ to a set of specified independent (input) 
variables x. Thus, what we want to get is a function f that uses x and a set 
of coefficients w such that 

t= f(a,w) +e (2.1) 


where e represents the error (noise) term. 
The form of f is usually pre-defined in standard regression techniques as 
for example linear regression (frinreg) and ANNs (fann): 


frinReg(@,w) = wo + w121 +... + Wn tn (2.2) 
fann (2, w) = wo: (wiz) (2.3) 
In linear regression, w is the set of coefficients wo, w1,..., Wn. In ANNs 


we usually use an auxiliary transfer function g (which normally is a sigmoid 
function as for example the logistic function re); the coefficients w are here 
called weights and include the weights from the hidden nodes to the output 
layer (wo) and those from the input nodes to the hidden nodes (w1) [Kei02]. 

In contrast to this, the function f which is searched for is not of any pre- 
specified form when applying genetic programming to symbolic regression. 
Instead, low-level functions are used and combined to more complex formu- 
las during the GP process. Given a set of functions fi,..., fu, the overall 
functional form induced by genetic programming can take a variety of forms. 
Usually, standard arithmetical functions such as addition, subtraction, mul- 
tiplication, and division are in the set of functions f, but also trigonometric, 
logical, and more complex functions could be included. 

An exemplary composed function therefore could be: 


f(x,w) = filfa(21), fs(£3, w1), fa(fo(a1, w2)), £2) 


or, by filling in some concrete functions for the abstract symbols f and w we 
could get: 
fi(x) = +(«(0.5, £), 1) =0.5*a+1 


fola) = +(2, x(x, £)) =Z2+rT*T 


When it comes to evaluating solution candidates in a GP-based symbolic 
regression algorithm, the formulas have to be evaluated on a certain set of 
evaluation data X yielding the estimated values Æ. These estimated values 
are then compared to the original values T, i.e., those which are known from 
data retrieval (experiments) or calculated by applying the original formula to 
X. 

For example, let ftarget be the target function 


frarget(z) = —(*(0.5, +(x, £)), 2) = 0.5 * 2? — 2 (2.4) 


48 Genetic Algorithms and Genetic Programming 


and the functions fı and f2 solution candidates. Furthermore, let the input 
data X be 
X = ([-5,—-4,...,+4, +5]. (2.5) 


Thus, by evaluating frarget, fi, and fo on X we get T, E1, and Eo: 


T = (10.5, 6, 2.5, 0, —1.5, —2, —1.5, 0, 2.5, 6, 10.5] (2.6) 
E, = [-1.5, —1, —0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5] (2.7) 
E> = (27,18, 11,6, 3, 2, 3, 6, 11, 18, 27] (2.8) 


By crossing fı and f2, these become parent functions (parent formula 1 and 
2) and we could for example get the child formula fs: 


falx) = +(*(0.5, (av, x)), 1) =0.5xgzxzr+1 (2.9) 

and by evaluating it on X we get E3: 
Es = [13.5,9, 5.5, 3, 1.5, 1, 1.5, 3, 5.5, 9, 13, 5] (2.10) 
Graphical displays of the formulas fı, fo, and f3 (labeled as parent and 


child functions) and their evaluations are given in the Figures 2.16 and 2.15, 
respectively. 


Parent Formula (Il) 


Child Formula 


Target Function 


Parent Formula (I) 


FIGURE 2.15: A symbolic regression example. 


The task of GP in symbolic regression thus is to find a composition of the 
functions, input variables, and coefficients that minimizes the error of the 
function with respect to the desired target values. There are several ways 
how to measure this error, one of the simplest and probably most frequently 
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FIGURE 2.16: Exemplary formulas. 


used ones being the mean squared error (mse) function; the mean squared 
error of the vectors A and B each containing n values is calculated as 


mse(A, B) = Ay S (Ay — By)”; |A| = |B] =n (2.11) 
n 
k=1 


So, we can calculate the fitness of fı, f2, and fs as mse( E1, T), mse(E2,T), 
and mse(£3,T), respectively yielding 


fitness(fi) = 26.0 (2.12) 
fitness( f2) = 100.5 (2.13) 
fitness( f3) = 9.0 (2.14) 


Whereas the search for formulas that minimize a given error function (or 
maximize some other given fitness function) is the major goal of GP-based 
regression, the shape and the size of the solution could also be integrated into 
the fitness estimation function. The number and values of coefficients used is 
another issue that is tackled in the optimization process; the search process is 
also free whether to consider certain input variables or not, and thus it is able 
to perform variables selection (possibly leading to dimensionality reduction) 
[Kei02]. 


2.4.4 Other GP Applications 


Finally we shall here give a short list of problems for which GP has proven 
to be able to produce high quality results - this list of course comes without 
the claim of completeness. 

Koza can be for sure seen as one of the pioneers of applying GP to a 
variety of different problems: In [Koz92b], [Koz94], [KIAK99], and [KKS*03b] 
he reports (together with co-authors) on the GP-based solving of problems 
for example in classification, regression, pattern recognition, computational 
molecular biology, emergent behavior, cellular automata, sorting networks, 
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design of topology and component sizing for complex hardware structures 
(such as analog electrical circuits, controllers, and antenna), and many others. 
Many of those results can be considered human-competitive results, some even 
being patentable new inventions created by GP. 

In hardware design, for example, one of the problem situations explained in 
[KIAK99] is the automated design of amplifiers. In general, an amplifier is a 
circuit with one input and one output which multiplies the voltage of its input 
signal by a certain factor (the so-called voltage amplification factor) over a 
specified range of frequencies. The goal then is to realize such an amplifier 
only using resistors, capacitors, inductors, transistors, and power sources; the 
functions set used thus includes component creating functions for creating 
digital gates, inductors, transistors, power supplies, and resistors. Solution 
candidates are tree structures representing complete hardware entities which 
can be displayed in a way which we are used to. 

Of course, there is a vast number of other fields of applications for genetic 
programming. Numerous applications of GP to problems of practical and 
scientific importance have for example also been documented in the confer- 
ence proceedings of the GECCO, CEC, or EuroGP conferences ([CP* 03a], 
[CP*03b], [Dt 04a], [D*04b], [B05], [K*06], [T+07], [JKOL*04], [KTCT05], 
[CTE* 06], or [ET 07], e.g.). Please see the GP bibliography (Section 2.8) for 
a short list of sources of publications on those. 


2.5 GP Schema Theories 


As we have summarized how genetic programming works, we shall now 
turn our minds towards investigations why it works so well. Holland’s work 
in the mid-1970s produced the well-known GA schema theorem; schemata 
have since then been frequently used to demonstrate how and why GAs work. 
In fact, as is summarized in [PMR04], in the 1990s interest in GA theory 
shifted towards exact microscopic Markov chain models possibly with aggre- 
gated states. However, after the work of Stephens and collaborators in the 
late 1990s on exact schema theories based on the notion of dynamic building 
blocks and the connection highlighted by Vose between his model and a dif- 
ferent type of exact schema-based model, it is now clear that Markov-chain 
and schema-based models are, when exact, just different representations of 
the same thing. 

Genetic programming theory has had a “difficult childhood,” as Poli et 
al. stated in [PMR04]: After some early works on approximate GP schema 
theorems, it took quite some time until schema theories could be developed 
that give exact formulations for expected frequencies of schemata at the next 
generation. 
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In this section we give a rough overview of these GP schema theorems: 
After summarizing early work on GP schema theories in Section 2.5.1, which 
see schemata as components of programs, we give an introduction to rooted 
tree GP schema theories (Section 2.5.2) and an exact GP schema theory (Sec- 
tion 2.5.3). Finally, in Section 2.5.4 we summarize the GP schema theory 
concept. 

The classification of schemata given in this section follows the grand con- 
cepts of [LP02], Chapters 3-6. 


2.5.1 Program Component GP Schemata 


First attempts to explain why GP works were given by Koza; in short, he 
gave an informal argument showing that Holland’s schema theorem would 
work also for GP as described in [Koz92b], pp. 116-119. In Koza’s definition, 
a schema is defined as a set of program subtrees (S-expressions); a schema 
can so be used for defining a subspace of the program trees search space by 
collecting all programs that include all subtrees given by the schema. For 
example, the schema H=[(+ x 3), y] includes the programs *(y,+(x,3)) and 
*(+(y,3),+(2,+(x,3))) as they both include (at least) one occurrence of the 
S-expressions (+ x 3) and y. This example is displayed graphically in Fig- 
ure 2.17. 


m A^ 
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FIGURE 2.17: Programs matching Koza’s schema H=[(+ x 3), y]. 


The first probabilistic model of GP that can be considered a mathematical 
formulation of a schema theorem for GP [LP02] was given by Altenberg in 
[Alt94a]. Also assuming very large populations, the neglection of mutation, 
and the application of proportional selection, he was able to calculate the 
frequency of a program at the next generation. Altenberg used a schema 
concept in which schemata are subexpressions and not, as in Koza’s work, 
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collections of subexpressions. 

O’Reilly formalized Koza’s work on schemata ([O’R95], [0094]) and derived 
a schema theorem for GP with proportional selection and crossover (but with- 
out mutation). The main difference to Koza’s approach was that she defined 
schemata as collections of subtrees and tree fragments; tree fragments in this 
context are trees with at least one leaf being a “don’t care” symbol (‘#’). 
O’Reilly was also able to calculate the frequency of a program at the next 
generation; unfortunately, the frequency depends on the shape, the size, and 
the composition of the trees containing the schemata investigated. Thus, fre- 
quencies are given rather as lower bounds than as concrete values. As O’Reilly 
argued in the discussion of her result, no hypotheses can be made on the basis 
of this theorem regarding the real propagation and the use of building blocks 
in GP. 

Another approach was investigated by Whigham: He produced a definition 
of schemata for context free grammars and the related schema theorem which 
was published for example in [Whi95], [Whi96b], and [Whi96a]. Based on his 
definition of schemata he was able to give equations for the probabilities of 
disruption of schemata by crossover and mutation. Like in O’Reilly’s work, 
also in Whigham’s theorem the propagation of the components of schemata 
from one generation to the next is described. 

In all these early attempts GP schemata were used for modeling how com- 
ponents (or groups of components) propagate within the population and how 
the number of these instances can vary over time. 


2.5.2 Rooted Tree GP Schema Theories 


In rooted tree GP schema theory, a schema can be seen as a set of points 
of the search space that share some syntactic feature. This can be defined 
in the following way [PMR04]: Let F be the set of functions used, and T 
the set of terminals. Syntactically, a GP schema is then defined as a tree 
composed of functions from the set F U {=} and terminals from T U {=}; 
the primitive = here means “don’t care” and stands for a single terminal or 
function. Semantically, H is the set of programs that have the same shape 
and the same labels for the non-“=” nodes as the tree representation of H. 

A simple example is given in Figure 2.18: Let F be defined as F = {+, —, *} 
and T as T = {x,y,z}, and the schema H given as *(=,= (#,=)). For ex- 
ample, the programs *(y, *(x,x)), *(z, +(x, z)), and *(a#, —(a, z)) are program 
members of H, i.e., they are included in H’s semantics. 

Rosca proposed this kind of schemata in [Ros97] (using the symbol ‘#’ 
instead of ‘=’). He formulated his schema theorem so that it became possible 
to calculate a lower bound for a schema’s frequency at the next generation. 
As a matter of fact, here also schemata divide the space of programs into 
subspaces containing programs of different sizes and shapes. 

Contrary to this, the following fixed-size-and-shape theory for GP was de- 
veloped by Poli and Langdon ({PL97c], [PL97a]): 
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Schema Schema 


Syntax: Semantics: 
= = y * z ie x - 
X = x x x Zz x rA 


FIGURE 2.18: The rooted tree GP schema *(=,= (x, =)) and three exem- 
plary programs of the schema’s semantics. 


Under the assumption that fitness proportional selection is applied, the 
probability of a program h sampling the schema H to be selected is 


m(H,t)f(H, t) 
M f(t) 


where m(H,t) denotes the number of programs matching the schema H at 
generation t, f(H,t) the mean fitness of programs matching H, M the popu- 
lation size, and f(t) the mean fitness of the programs in the population. 

The main idea is that the probability of the disruption of a schema can 
be estimated. Let D.(H) be the event “H is disrupted when a program h 
matching H is crossed over with a program h”; as is described in full detail in 
cite [LP02], the probability of such a disruption caused by one-point crossover 
can be formulated as 


Pr{h € H} = (2.15) 


Pr{De(H)} < pais s(t) (1 — ee 


MFE) 
£H) mG), t)f(GH), t) —m(H,t)f(H,t) 
N(H)—1 Mf(t) 

where G(H) is the shape of all programs matching the schema H (which 
is called the hyperspace of H), and £(H) the defining length of H; paisy is 
the probability of the disruption of schema H by crossing h (matching H) 
with program h that has different shape than h, i.e., which is not in G(#): 
paigj(t) = Pr(De(E)|h ¢ GH). 

When it comes to point mutation, a schema H will survive mutation only 
if all of its O(H) defining nodes are not modified. Thus, the probability of H 
being disrupted by mutation Pr{D,,(H)} is dependent on the probability of 
a node to be altered (pm): 


(2.16) 


Pr{Dm(H)} = 1 = (1 = pm)? (2.17) 


The overall formula uses these partial results and finally gives the expected 
number of programs matching schema H at generation t + 1: 


E(m(A,t+1)] > MPr{h € H}(1—Pr{Dp(A)})(1—peoPr{D-(H)}) (2.18) 
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By substituting (2.15), (2.16), and (2.17) in (2.18) we get the final overall 
formula for the lower bound of individuals sampling H at generation t + 1 in 
generational GP with fitness proportional selection, one-point crossover, and 
point mutation as it is given in [LP02]. 

This GP schema theorem, produced by generalizing Holland’s GA schema 
theorem, thus gives a pessimistic lower bound for the expected number of 
copies of a schema in the next generation. In the next chapter we will sum- 
marize an exact GP schema theory, produced by generalizing an exact GA 
schema theorem and using the concept of hyperschemata. 


2.5.3 Exact GP Schema Theory 


In the previous section we have summarized pessimistic GP schema theory 
based on generalization of Holland’s GA schema theorem. As Langdon and 
Poli summarize in [LP02], the usefulness of these schema theorems has been 
widely criticized (see [CP94], [Alt94b], [FG97], [FG98], or [Vos99], e.g.). In 
order to overcome its main drawbacks, namely that they are pessimistic and 
only give lower bounds for the expected numbers of instances for a given 
schema at the next generation, more exact schema theorems for GAs and GP 
had to be developed. These are going to be summarized in this section: After 
explaining the main idea of Stephen and Waelbroeck’s GA schema theory, 
the hyperschema concept is summarized, and finally, on the basis of these 
hyperschemata, exact GP schema theorems. 

An exact GA schema theorem has been developed by the end of the last mil- 
lennium ([SW97], [SW99]): The total transmission probability œ of a schema 
H is defined so that a(H, t) is the probability that at generation t the individ- 
uals of the GA’s population will match H. Assuming a crossover probability 
Pxo, Q(H, t) is calculated as: 


N-1 
a(H,t) = (1 — pzo)p( H, t) + q JO P(LH, i), DPR, i), t) (219) 


with L(H,i) and R(H,i) being the left and right parts of schema H, respec- 
tively, and p(H,t) the probability of selecting an individual matching H to 
become a parent. The “left” part of a schema H is thereby produced by re- 
placing all elements of H at the positions from the given index ¿i to N with 
“don’t care” symbols (with N being the length of the bit strings); the “right” 
part of a schema H is produced by replacing all elements of H from position 
1 to 2 with “don’t care.” The summation sums over all positions from 1 to 
N — 1, i.e., over all possible crossover points. A generalization of this theorem 
to variable-length GAs has also been constructed [SPWR02]. 

After the publication of this exact GA schema theory, immediately the 
question came to mind whether it would be possible to extend pessimistic 
GP schema theories towards an exact GP schema theorem [LP02]. In fact, 
it was: Poli developed an exact GP schema theorem (see [Pol99a], [Pol00cI, 
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[Pol00b], [Pol00a], e.g.), a theorem which was then generalized by Poli and 
McPhee to become known as Poli and McPhee’s Exact GP Schema Theorem 
([PMO01b], [PMO1a], [PRM01], [PM01b], [Pol01], [PM03a], [PM03b], [PMR04], 
and [LP02}). 

Assuming equal size and shape for GP programs, (2.19) can be also used for 
describing the transmission probability of a fixed-size-and-shape GP schema. 
In the presence of one-point crossover, the transmission probability for a GP 
schema H at generation t, a(H,t), can be thus given as 


Pzo 
N(H) 


N-1 
a(H, t) = (1 — Pro)p(H, t) + D p(U(H, i), t)plu(H, i), t) (2.20) 


i=l 


with I(H,i) and u(H,i) being the lower and upper parts (building blocks) 
of schema H, respectively, and N(H) the number of nodes in the schema 
(which is assumed to have the same size and shape as all other programs in 
the population). I(H,i) is defined as the schema produced by replacing all 
nodes above cutting point i with “don’t care” symbols, and u(H,i) as the 
schema produced by replacing all nodes below cutting point i with “don’t 
care” symbols. In analogy to (2.19), the summation in (2.20) sums over all 
possible crossover points. 

Exemplary l and u schemata for the schema H = +(*(=,x),=) are shown in 
Figure 2.19. 

In order to generalize this exact GP schema theorem so that it can be ap- 
plied to populations of programs of different sizes and shapes, a more general 
schema approach is used, namely the GP hyperschema concept. 

A GP hyperschema represents a set of schemata in the same way as a 
schema represents a set of program trees (which is why it is called “hyper- 
schema”). This can be defined in the following way [PMR04]: Let F be the 
set of functions used, and T the set of terminals. Syntactically, a GP schema 
is then defined as a tree composed of functions from the set F U {=} and 
terminals from T U {=,#}. The primitives = and # here mean “don’t care”; 
= stands for exactly one node, whereas # stands for any valid subtree. 

Examples are shown in Figure 2.20: Let F be defined as F = {+,—,*}, T as 
T = {x,y,z}, and the hyperschema H given as *(#, = (x, =)). The three ex- 
emplary programs *(y, *(x, *)), *(*(x, y), +(x, z)), and *(*(*(z, y), y), +(x, z)) 
are a part of H’s semantics. 

In analogy to 1(H, i) and u(H, i) defined above and sketched in Figure 2.19, 
the hyperschemata building blocks L(H,i) and U(H,7) are defined in the 
following way: L(H, i) is the hyperschema obtained by replacing all nodes on 
the path between crossover point 7 and the root of hyperschema H with = 
nodes, and all subtrees connected with those nodes with # nodes. U(H,1) 
is the hyperschema obtained by replacing the subtree below crossover point i 
with a # node [PMR04]. 

As examples might here also help to make this concept clearer, Figure 2.21 
shows an exemplary schema H = +(*(=,x),=) and potential hyperschema 
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H I(H,1) u(H,1) (H,2) 


aX h x HR 


ax /\ o 


= x = = = = 


FIGURE 2.19: The GP schema H = +(*(=,x),=) and exemplary u and | 
schemata. Cross bars indicate crossover points; shaded regions show the parts 
of H that are replaced by “don’t care” symbols. 


Hyperschema Hyperschema 
Syntax: Semantics: 


A OMAK AA 
/\ 


x y 


FIGURE 2.20: The GP hyperschema *(#,= (x,=)) and three exemplary 
programs that are a part of the schema’s semantics. 


building blocks. As for example shown in the second column, L(H,1) is 
constructed by turning all nodes between crossover point 1 and the root (in 
this case only the root node) into = nodes, and all subtrees of the so modified 
nodes become # nodes. U(H,1) is in column 3 constructed by replacing the 
subtree under crossover point 1 into a # node. And finally, as can be seen 
in column 4, L(H, 2) is again constructed by turning all nodes from crossover 
point 2 to the root into = nodes, and all subtrees of the so modified nodes 
become # nodes. 
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H L(H,1) U(H,1) L(H,2) 


7% AAN Bh 


FIGURE 2.21: The GP schema H = +(*(=,x),=) and exemplary U and L 
hyperschema building blocks. Cross bars indicate crossover points; shaded 
regions show the parts of H that are modified. 


Using hyperschemata, it is possible to formulate a general, exact GP schema 
theorem for populations of programs of any size or shape. The total trans- 
mission probability of a fixed-size-and-shape GP schema H is, for GP with 
one-point crossover and no mutation, given as 


a(H,t) = (1 — pro)\p(H, t)+ (2.21) 


Peo dD, mote.) Sie XO (hi € L(H,i))5(h2 € U(H,i)) 


i€C(h1,h2) 


where NC(hy, h2) is the number of nodes in the tree fragment representing the 
common region of the programs hı and hg, C(hı, h2) is the set of indices of 
the crossover points in the common region of hı and he, and 6(z) is a function 
that returns 1 if x is true and 0 otherwise. The first two summations sum 
over all individuals in the population, i.e., we sum over all possible pairs of 
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programs; the second summation sums over all indices of crossover points of 
the common region of the respective programs pair. 

This GP schema theorem is called the “Microscopic Exact GP Schema 
Theorem” in the sense that it is necessary to consider each member of the 
population. 

Via several transformations and lemmata (which are not given here) it is 
finally possible to formulate the “Macroscopic Exact GP Schema Theorem” : 


a(H, t) = (1 — Pro) p(H, t)+ (2.22) 


1 
zo anA A L H, ) 5 H, ) „t 
i 3 > NC(G;, Gx) wae l “ ne K Ki k 


where G(H) denotes the schema that is obtained by replacing all nodes in a 
schema H by “don’t care” symbols’; the sets L(H, i) N G; and U(H, i) N Gk 
are either schemata (of fixed size and shape), or the empty set 0). 

Thus, using this theorem (2.5.3), it is at last possible to give the exact 
transmission probability of a schema for genetic programming under one-point 
crossover and no mutation; an exact schema theorem for GP is established. 
We have here omitted lots of transformation steps and proofs; for these, the 
interested reader is for example referred to [PMO03a], [PM03b], [LP02], or 
[PMRO4]. 

An overview of the development of approximate and exact schema theorems 
for GAs and GP is graphically shown in Figure 2.22 (as given in [PMR04]). 


GP with 
One-Point 
Crossover 


ra 


Holland's GA Schema Theorem (1975); 
Whitley's Version (1993) 


GAs with One- 
Point Crossover 


Poli and Langdon's 
GP Schema Theorem (1997) 
Refinement 


Poli's Exact GP Schema 
Theorem (2000) 


Generalization 
Generalization 


Stephens‘ GA Schema Theorem (2001) Poli and McPhee's Exact 
GP Schema Theorem (2001) 


FIGURE 2.22: Relation between approximate and exact schema theorems for 
different representations and different forms of crossover (in the absence of 
mutation). 
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8G(H) is called the hyperspace of H. 
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2.5.4 Summary 


Until the development of the GP schema theorems described in this section, 
GP theory was typically considered scarce, approximate, and not terribly use- 
ful [PMO1c]. The facts, that GP is relatively young and that building theories 
for variable size structures are very complex, are considered the reasons for 
this. 

Significant breakthroughs, which have been summarized in this section, 
have fundamentally changed this understanding; after the development of GP 
schema theorems, we now have an exact GP theory based on schema and 
hyperschema concepts. 


2.6 Current GP Challenges and Research Areas 


Of course, theoretical work on GP was by far not finished after the devel- 
opment of GP schema theorems. Even though they shall not be discussed in 
detail here, we still want to line out a selection of current research areas in 
GP theory. 

For example, operators design for GP has been discussed in numerous publi- 
cations; extensive analysis of initialization, crossover, and mutation operators 
can be found in [Lan99], [ES03], or [LNOO], for example. 

The genetic programming search space has been subject to theoretical anal- 
ysis (see [LP98], [LP02], e.g.). Experimental exploration of the GP search 
space by random sampling can be used for comparing GP to random search 
or other search techniques. Additionally, hypotheses have been stated regard- 
ing minimum and maximum tree depth. 

As has already been mentioned before, a Markov model for GAs has been 
formulated by Vose, see [NV92], [VL91], and [Vos99] for explanations. In 
short, a GA is modeled as a Markov chain; selection, mutation, and crossover 
are incorporated into an explicitly given transition matrix, thus the method 
is complete, and no special assumptions are made which restrict populations 
or population trajectories. 

This GA Markov model could also be extended to GP using the schema 
GP theory described in the previous section, which gives exact formulas for 
computing the probability that reproduction and recombination create any 
specific program. A GP Markov chain model is then easily obtained by plug- 
ging this ingredient into a minor extension of Vose’s model of GAs [PMR04]; 
in fact, an alternative approach for describing the dynamics of evolutionary 
algorithms is provided by this theory. 

One fact has been known for genetic programming since some of its first 
applications and has been frequently reported: Programs in genetic program- 
ming populations tend to grow in size ([Ang94], [Lan95], [NB95], [SFD96], 
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[AA05], [Ang98], [TH02]). “Redundancy,” “introns,” and, probably most fre- 
quently used as well as with the most negative connotation, “bloat” have 
(amongst others) been used since then as names for this tendency. In princi- 
ple, it means that introns, i.e., code which has no effect on the performance 
of the program containing it, grow during the GP process; it is in fact a 
phenomenon also known from natural evolution [WL96]. 

Of course, this seems to be an unwanted phenomenon and does not con- 
form to “Occam’s Razor”, a law attributed to the 14th-century Francisian friar 
William of Ockham. This law is also known as the “law of parsimony,” the 
Latin principle “entia non sunt multiplicanda praeter necessitatem” meaning 
that “entities should not be multiplied beyond necessity” is also often quoted. 
In principle, this law demands the selection of exactly that theory that pos- 
tulates the fewest entities and introduces the fewest assumptions (of course, 
in case if there are multiple competing theories which are considered equal in 
other respects). Argumentations pointing out how and why GP does or does 
not fulfill Occam’s law can be found in [Dro98] and [LP97], for example.’ 

Examples for bloat are given in Figure 2.23: In the left example, the left 
subtree will always return (x — (0 * y + x)) = z — x = 0 and since the multi- 
plication of 0 with any other value always results in 0, the result of the whole 
program will always be 0 regardless of the values of x, y, and z. In fact, the 
whole right subtree becomes code that does not influence the whole program’s 
evaluation. In the second example shown on the right part of Figure 2.23, A 
will always be smaller than A + 4; thus the condition of the root condition 
will always be fulfilled and “else”-branch will never be activated. 


FIGURE 2.23: Examples for bloat. 


In contrary to the examples in Figure 2.23, in which bloat is rather obvious, 


Especially “The Myth of Occam’s Razor” [W18], a paper written by Thorburn in 1918, is 
worth reading in this context as it discusses the origins of the principle. For more discussions 
on Occam’s razor and its reception in philosophy and science the interested reader is referred 
to [Jac94], [Nol97], [Pop92], or [RF99]. 
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there are also of course examples in which it can be seen that GP will not 
always automatically produce rather simple results. 

In their article entitled “Fitness Causes Bloat” [LP97], Langdon and Poli 
showed that fitness-based selection seems to be responsible for the solutions’ 
growth in size; fitness-based parent selection therefore leads to code bloat. 
In this context bloat has also been ironically described as “survival of the 
fattest.” 

According to [Zha97], [Zha00], and [LP02], approaches used for preventing 
or at least decreasing bloat include, but are not restricted to the following 
anti-bloat techniques: 


e Size and/or depth limitations: The growth of programs is limited, pro- 
grams are not allowed to become bigger in size and/or depth (where the 
size of a program is normally the size of its structure tree and its depth 
the depth of its structure tree). Size limits are nowadays commonly 
used, see for example [KIAK99]. 


e Incorporation of program size in the selection process: An also often 
used technique to combat bloat is to include some preference for smaller 
programs in the criterion used to select programs for reproduction; this 
additional factor to selection is also called parsimony pressure. Exam- 
ples and analysis can be for example found in [Kin93], [Zha97], [Zha00], 
[SF98], and [SH98] 


e Incorporation of program size in evaluation: The size of a program 
could of course also be incorporated in its evaluation. It might also 
be included as one of the goals which the GP population tries to reach 
([LN00], [ENO1]). 


e Genetic operators: Besides selection and evaluation, several crossover 
and mutation operators have been proposed which are designed so that 
they combat bloating, see for example [Ang98], [PL97b], or [Lan00]. 


Often we see another tendency of GP that does not fulfill Occam’s law, 
namely that it is prone to producing programs that are overspecified. This 
means that programs that are too complex for the problem at hand and that 
much simpler programs could fulfill the given task as well; especially in data- 
based modeling this phenomenon is also known as “overfitting.” We shall come 
back to this topic in Chapter 11. 

Another field of GP research is the development of practical guides for ideal 
parameter settings for GP. As we find in [SOG04], for example, GP researchers 
and practitioners are often frustrated by the lack of theory available to guide 
them in selecting key algorithm parameters; GP population sizes, for exam- 
ple, run from ten to a million members or more, but at present there is no 
practical guide to knowing when to choose which size. [SOG04] here gives 
a population-sizing relationship depending on tree size, solution complexity, 
problem difficulty, and building block expression probability. 
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Furthermore, numerous other theoretical topics are widely discussed in the 
GP community, lots of them directly connected to well known problems (or 
rather challenges) with GP. Selected ones are to be mentioned in the next 
chapters. 

As a part of the conclusions of [LP02], Langdon and Poli demand that GP 
users might like to consider how their GP populations are evolving, whether 
they are converging, and, if so, whether they are converging in the right direc- 
tion. At the present, many GP packages offer only few possibilities to monitor 
populations. As we are going to demonstrate in later chapters, this is exactly 
what we try to accomplish by investigating dynamics in the populations of 
our GA and GP implementations. 


2.7 Conclusion 


In this chapter, genetic programming has been summarized and described 
as a powerful extension to the genetic algorithm. In fact, GP is more than a 
GA extension: It can be rather seen as the art of evolving computer programs 
and as a generic concept for the automated programming of computers. 

After describing GP basics and a variety of applications for GP, we 
have summarized theoretical concepts for GP-based on schemata and hyper- 
schemata. Problems and challenges in the context of GP have also been 
discussed. 

In the following chapters we shall now come back to algorithmic devel- 
opments in GAs. We will especially concentrate on enhanced algorithmic 
concepts which have been developed in order to support crossover-based evo- 
lutionary algorithms in their intention to combine those parts of chromosomes 
that define high quality solutions; these advanced concepts can of course also 
be used with GP. 

In Chapter 11 we then come back to GP and its application to data-based 
system identification; we also demonstrate the effects of these algorithmic 
enhancements in GP. 


2.8 Bibliographic Remarks 


There are numerous books, journals, and articles available that survey the 
field of genetic programming. In this section we summarize some of the most 
important ones. Representatively, the following books are widely considered 
very important sources of information about GP: 
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e J. R. Koza et al.: Genetic Programming I - IV ({Koz92b], [Koz94], 
[KIAK99], [KKS*03b]): A series of books on theory and praxis of ge- 
netic programming by John Koza and varying co-authors 


e W. Banzhaf et al.: Genetic Programming — An Introduction [BNKF98] 
e W. Langdon: Genetic Programming and Data Structures [Lan98] 
e W. Langdon and R. Poli: Foundations of Genetic Programming [LP02] 


The following journals are dedicated to either theory and applications of 
genetic programming or evolutionary computation in general: 


e Genetic Programming and Evolvable Machines (Springer Netherlands) 
e IEEE Transactions on Evolutionary Computation (IEEE) 
e Evolutionary Computation (MIT Press) 


Moreover, several conference and workshop proceedings include papers re- 
lated to genetic programming. Some examples are the following ones: 


e Genetic and Evolutionary Computation Conference (GECCO), a recom- 
bination of the International Conference on Genetic Algorithms and the 
Genetic Programming Conference 


e Congress on Evolutionary Computation (CEC) 
e Parallel Problem Solving from Nature (PPSN) 
e European Conference on Genetic Programming (EuroGP) 


Of course there is lots of GP-related information available on the in- 
ternet including theoretical background and practical applications, course 
slides, and source code. Probably the most comprehensive overview 
of publications in GP is The Genetic Programming Bibliography which 
is maintained by Langdon, Gustavson, and Koza and available at 
http://www.cs.bham.ac.uk/~wbl/biblio/. 

Finally, publications of the Heuristic and Evolutionary Algorithms Labo- 
ratory (HEAL) (including several articles on GAs and GP) are available at 
http://www.heuristiclab.com/publications/. 


Chapter 3 


Problems and Success Factors 


3.1 What Makes GAs and GP Unique among Intelligent 
Optimization Methods? 


In contrast to trajectory-based heuristic optimization techniques such as 
simulated annealing or tabu search, and also in contrast to population-based 
heuristics which perform parallel local search as for example the conven- 
tional variants of evolution strategies (ES without recombination), genetic 
algorithms and genetic programming operate under fundamentally different 
assumptions. 


A neighborhood-based method usually scans the search space around a 
current solution in a predefined neighborhood in order to take moves to more 
promising directions, and are therefore often confronted with the problem of 
getting stuck in a local, but not global optimum of a multimodal solution 
space. 


What makes GAs and GP unique compared to neighborhood-based search 
techniques is the crossover procedure which is able to assemble properties of 
solution candidates which may be located in very different regions of the search 
space. In this sense, the ultimate goal of any GA or GP is to assemble and 
combine the essential genetic information (i.e., the alleles of a globally optimal 
or at least high quality solution) step by step. This information is initially 
scattered over many individuals and must be merged to single chromosomes 
by the final stage of the evolutionary search process. This perspective, which 
is under certain assumptions stated in the variants of the schema theory and 
the according building block hypothesis, should ideally hold for any GA or 
GP variant. This is exactly the essential property that has the potential to 
make GAs and GP much more robust against premature stagnation in local 
optimal solutions than search algorithms working without crossover. 


65 


66 Genetic Algorithms and Genetic Programming 


3.2 Stagnation and Premature Convergence 


The fundamental problem which many meta-heuristic optimization meth- 
ods aim to counteract with various algorithmic tricks is the stagnation in a 
locally, but not globally optimal solution. As stated previously, due to their 
methodology GAs and GP suffer much less from this problem. 

Unfortunately, also users of evolutionary algorithms using crossover fre- 
quently encounter a problem which, at least in its effect, is quite similar to 
the problem of stagnating in a local, but not global optimum. This draw- 
back, in the terminology of GAs called premature convergence, occurs if the 
population of a GA reaches such a suboptimal state that the genetic solu- 
tion manipulation operators (crossover and mutation) are no longer able to 
produce offspring that outperform their parents (as discussed for example in 
[Fog94], [Aff03]). In general, this happens mainly when the genetic informa- 
tion stored in the individuals of a population does not contain that genetic 
information which would be necessary to further improve solution quality. 

Several methods have been proposed to combat premature convergence in 
genetic algorithms (see [LGX97], [Gao03], or [Gol89], e.g.). These include, 
for example, the restriction of the selection procedure, the operators, and the 
according probabilities as well as the modification of the fitness assignment. 
However, all these methods are heuristic per definition, and their effects vary 
with different problems and even problem instances. A critical problem in 
studying premature convergence therefore is the identification of its occur- 
rence and the characterization of its extent. Srinivas and Patnaik [SP94], 
for example, use the difference between the average and maximum fitness as 
a standard to measure genetic diversity, and adaptively vary crossover and 
mutation probabilities according to this measurement. 


Classical Measures for Diversity Maintenance 


The term “population diversity” has been used in many papers to study 
premature convergence (e.g., [SFP93], [YA94]) where the decrease of popula- 
tion diversity (i.e., a homogeneous population) is considered as the primary 
reason for premature convergence. The basic approaches for retarding prema- 
ture convergence discussed in GA literature aim to maintain genetic diversity. 
The most common techniques for this purpose are based upon pre-selection 
[Cav75], crowding [DeJ75], or fitness-sharing [Gol89]. The main idea of these 
techniques is to maintain genetic diversity by the preferred replacement of 
similar individuals [Cav75], [DeJ75] or by the fitness-sharing of individuals 
which are located in densely populated regions [Gol89]. While methods based 
upon those discussed in [DeJ75] or [Gol89] require some kind of neighbor- 
hood measure depending on the problem representation, the approach given 
in [Gol89] is additionally quite restricted to proportional selection. 
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Limitations of Diversity Maintenance 


In basic GA literature the topic of premature convergence is considered 
to be closely related to the loss of genetic variation in the entire popula- 
tion ([SFP93], [YA94]). In the opinion of the authors this perspective, which 
mainly stems from natural evolution, should be considered in more detail for 
the artificial evolutionary process as being performed by an GA or GP. In nat- 
ural evolution the maintenance of genetic diversity is of major importance as 
a rich gene pool enables a certain species to adapt to changing environmental 
conditions. In the case of artificial evolution, the environmental conditions, 
for which the chromosomes are to be optimized, are represented in the fitness 
function which usually remains unchanged during the run of an algorithm. 
Therefore, we do not identify the reasons for premature convergence in the 
loss of genetic variation in general but more specifically in the loss of what we 
call essential genetic information, i.e., in the loss of alleles which are part of 
a global optimal solution. Even more specifically, whereas the alleles of high 
quality solutions are desired to remain in the gene pool of the evolutionary 
process, alleles of poor solutions are desired to disappear from the active gene 
pool in order to strengthen the goal-directedness of evolutionary search. 

Therefore, in the following we denote the genetic information of the global 
optimal solution (which is unknown to the algorithm) as essential genetic 
information. If parts of this essential genetic information are missing or get 
lost, premature convergence is already predetermined in a certain way as only 
mutation (or migration in the case of parallel GAs) is able to regain this 
genetic information. 

A very essential question about the general performance of a GA is whether 
or not good parents are able to produce children of comparable or even bet- 
ter fitness — after all, the building block hypothesis implicitly relies on this. 
Unfortunately, this property cannot be guaranteed easily for GA applications 
in general: The disillusioning fact here is that the user has to take care of an 
appropriate encoding in order to make this fundamental property hold. 

Reconsidering the basic functionality of a GA, the algorithm selects two 
above average parents for recombination and sometimes (with usually rather 
low probability) mutates the crossover result. The resulting chromosome is 
then considered as a member of the next generation and its alleles are therefore 
part of the gene pool for the ongoing evolutionary process. 

Reflecting the basic concepts of GAs, the following questions and associated 
problems arise: 


e Is crossover always able to fulfill the implicit assumption that two above- 
average parents can produce even better children? 


e Which of the available crossover operators is best suited for a certain 
problem in a certain representation? 


e Which of the resulting children are “good” recombinations of their par- 
ents chromosomes? 
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e What makes a child a “good” recombination? 


e Which parts of the chromosomes of above-average parents are really 
worth being preserved? 


Conventional GAs are usually not always able to answer these questions 
in a satisfactory way, which should ideally hold for any GA application and 
not only for a canonical GA in the sense of the schema theorem and the 
building block hypothesis. These observations constitute the starting point 
for generic algorithmic enhancements as stated in the following chapters. The 
preservation of essential genetic information, widely independent of the actu- 
ally applied representation and operators, plays a main role. These advanced 
evolutionary algorithm techniques called offspring selection, relevant alleles 
preserving genetic algorithm (RAPGA), and SASEGASA will be exemplarily 
compared to a conventional GA in Chapter 7 and extensively analyzed in the 
experimental part of the book on the basis of various problems. 


Chapter 4 


Preservation of Relevant Building 
Blocks 


4.1 What Can Extended Selection Concepts Do to Avoid 
Premature Convergence? 


The ultimate goal of the extended algorithmic concepts described in this 
chapter is to support crossover-based evolutionary algorithms, i.e., evolution- 
ary algorithms that are ideally designed to function as building-block assem- 
bling machines, in their intention to combine those parts of the chromosomes 
that define high quality solutions. In this context we concentrate on selection 
and replacement which are the parts of the algorithm that are independent 
of the problem representation and the according operators. Thus, the appli- 
cation domain of the new algorithms is very wide; in fact, offspring selection 

and the RAPGA (a special variant of adaptive population sizing GA) can 
be applied to any application that can be treated by genetic algorithms (of 
course also including genetic programming). 


The unifying purpose of the enhanced selection and replacement strategies 
is to introduce selection after reproduction in a way that checks whether or 
not crossover and mutation were able to produce a new solution candidate 
that outperforms its own parents. Offspring selection realizes this by claim- 
ing that a certain ratio of the next generation (pre-defined by the user) has 
to consist of child solutions that were able to outperform their own parents 
(with respect to their fitness values). The RAPGA, the second newly intro- 
duced selection and replacement strategy, ideally works in such a way that 
new child solutions are added to the new population as long as it is possible 
to generate unique and successful offspring stemming from the gene pool of 
the last generation. Both strategies imply a self-adaptive regulation of the 
actual selection pressure that depends on how easy or difficult it is at present 
to achieve evolutionary progress. An upper limit for the selection pressure 
provides a good termination criterion for single population GAs as well as a 
trigger for migration in parallel GAs. 
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4.2 Offspring Selection (OS) 


As already discussed at length, the first selection step chooses the parents 
for crossover either randomly or in any other well-known way as for example 
roulette-wheel, linear-rank, or some kind of tournament selection strategy. 
After having performed crossover and mutation with the selected parents, 
we introduce a further selection mechanism that considers the success of the 
apparently applied reproduction. In order to assure that the progression of 
genetic search occurs mainly with successful offspring, this is done in such 
a way that the used crossover and mutation operators are able to create a 
sufficient number of children that surpass their parents’ fitness. Therefore, a 
new parameter called success ratio (SuccRatio € [0,1]) is introduced. The 
success ratio is defined as the quotient of the next population members that 
have to be generated by successful mating in relation to the total population 
size. Our adaptation of Rechenberg’s success rule ([Rec73], [Sch94]) for genetic 
algorithms says that a child is successful if its fitness is better than the fitness 
of its parents, whereby the meaning of “better” has to be explained in more 
detail: Is a child better than its parents, if it surpasses the fitness of the 
weaker parent, the better parent, or some kind of weighted average of both? 

In order to answer this question, we have borrowed an aspect from simu- 
lated annealing: The threshold fitness value that has to be outperformed lies 
between the worse and the better parent and the user is able to adjust a lower 
starting value and a higher end value denoted as comparison factor bounds; a 
comparison factor (CompF actor) of 0.0 means that we consider the fitness of 
the worse parent, whereas a comparison factor of 1.0 means that we consider 
the better of the two parents. During the run of the algorithm, the comparison 
factor is scaled between the lower and the upper bound resulting in a broader 
search at the beginning and ending up with a more and more directed search 
at the end; this procedure in fact picks up a basic idea of simulated annealing. 

In the original formulation of the SASEGASA (which will be described in 
Chapter 5) we have defined that in the beginning of the evolutionary pro- 
cess an offspring only has to surpass the fitness value of the worse parent 
in order to be considered as “successful”; as evolution proceeds, the fitness 
of an offspring has to be better than a fitness value continuously increasing 
between the fitness values of the weaker and the better parent. As in the case 
of simulated annealing, this strategy gives a broader search at the beginning, 
whereas at the end of the search process this operator acts in a more and 
more directed way. Having filled up the claimed ratio (SuccRatio) of the 
next generation with successful individuals using the success criterion defined 
above, the rest of the next generation ((1 — SuccRatio) -|POP|) is simply 
filled up with individuals randomly chosen from the pool of individuals that 
were also created by crossover, but did not reach the success criterion. The 
actual selection pressure ActSel Press at the end of generation i is defined by 
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the quotient of individuals that had to be considered until the success ratio 
was reached and the number of individuals in the population in the following 
way: 

|POP;+1| + |POOL;| 


ActSelPress = (4.1) 
|POP,| 
piiat POP, 
|POP| 
selection (roulette, linear rank, tournament, ...) 
crossover 
mutation 
|POOL| 
child no 
‘better’ than ee S aata POOL 
parents ? 
yes 
fill up rest of next population after 
enough 'better' children have 
been created 
Mi ee | ae | oe 
< | 
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FIGURE 4.1: Flowchart of the embedding of offspring selection into a genetic 
algorithm. This figure is displayed with kind permission of Springer Science 
and Business Media. 


Figure 4.1 shows the operating sequence of the concepts described above. 

An upper limit of selection pressure (MaxSelPress) defines the maximum 
number of offspring considered for the next generation (as a multiple of the 
actual population size) that may be produced in order to fulfill the success 
ratio. With a sufficiently high setting of MaxSelPress, this new model also 
functions as a detector for premature convergence: 

If it is no longer possible to find a sufficient number (SuccRatio - |POP|) 
of offspring outperforming their own parents even if (MaxSelPress -|POP|) 
candidates have been generated, premature convergence has occurred. 

As a basic principle of this selection model, higher success ratios cause 
higher selection pressures. Nevertheless, higher settings of success ratio, and 
therefore also higher selection pressures, do not necessarily cause premature 
convergence. The reason for this is mainly that the new selection step does 
not accept clones that emanate from two identical parents per definition. In 
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conventional GAs such clones represent a major reason for premature conver- 
gence of the whole population around a suboptimal value, whereas the new 
offspring selection works against this phenomenon (see Chapters 7, 10, and 
11). 


With all strategies described above, finally a genetic algorithm with the ad- 
ditional offspring selection step can be formulated as stated in Algorithm 4.1. 
The algorithm is formulated for a maximization problem; in case of minimiza- 
tion problems the inequalities have to be changed accordingly. 


Algorithm 4.1 Definition of a genetic algorithm with offspring selection. 
Initialize total number of iterations nrOf Iterations € N 
Initialize actual number of iterations i = 0 
Initialize size of population |POP| 
Initialize success ratio SuccRatio € [0,1] 
Initialize maximum selection pressure MaxSelPress € ]1, oo| 
Initialize lower comparison factor bound Lower Bound € (0, 1] 
Initialize upper comparison factor bound Upper Bound € [Lower Bound, 1] 
Initialize comparison factor CompF actor = Lower Bound 
Initialize actual selection pressure ActSelPress = 1 
Produce an initial population POP of size |POP| 


while (i < nrOfIterations) A (ActSelPress < MaxSelPress) do 
Initialize next population POPi+1 
Initialize pool for bad children POOL 


while (|POP,+1| < (|POP|- SuccRatio)) A ((|POP;+41| + |POOL]|) < (|POP|- 
MazxSelPress)) do 
Generate a child from the members of POP; based on their fitness values 
using crossover and mutation 


Compare the fitness of the child c to the fitness of its parents par; and parz 
(without loss of generality assume that par is fitter than par2) 
if fe < (foara + |foar, — foars| - CompFactor) then 

Insert child into POOL 


else 
Insert child into POP;+1 
end if 
end while 
ActSelPress = ZOP Poor] 


[POP] 


Fill up the rest of POP;+ı with members from POOL 
while |POP;+i| < |POP| do 
Insert a randomly chosen child from POOL into POP;+1 
end while 
Adapt CompF actor according to the given strategy 
i=i+1 
end while 
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For a detailed analysis of the consequences of offspring selection the reader is 
referred to Chapter 7 where the characteristics of a GA incorporating offspring 
selection will be compared to the characteristics of a conventional GA on the 
basis of a benchmark TSP. 


4.3 The Relevant Alleles Preserving Genetic Algorithm 
(RAPGA) 


Assuming generational replacement as the underlying replacement strategy 
the most essential question at generation 7 is which parts of genetic infor- 
mation from generation i should be maintained in generation i + 1 and how 
this could be done most effectively applying the available information (chro- 
mosomes and according fitness values) and the available genetic operators 
selection, crossover, and mutation. 

The here presented variant of enhanced algorithmic concepts based upon 
GA-solution manipulation operators aims to achieve this goal by trying to 
bring out as much progress from the actual generation as possible and losing 
as little genetic diversity as possible at the same time. 

This idea is implemented using ad hoc population size adjustment in the 
sense that potential offspring generated by the basic genetic operators are 
accepted as members of the next generation if and only if they are able to 
outperform the fitness of their own parents and if they are new in the sense 
that their chromosome consists of a concrete allele alignment that is not rep- 
resented yet in an individual of the next generation. As long as new and 
(with respect to the definition given previously) “successful” individuals can 
be created from the gene pool of the actual generation, the population size 
is allowed to grow up to a maximum size. A potential offspring which is not 
able to fulfill these requirements is simply not considered for the gene pool of 
the next generation. 

Figure 4.2 represents the gene pool of the alleles at a certain generation i 
and Figure 4.3 illustrates how this genetic information can be used in order 
to generate a next population i + 1 of a certain size which may be smaller 
or larger than that of the actual population 7. Whether the next population 
becomes smaller or larger depends on the success of the genetic operators 
crossover and mutation in the above stated claim to produce new and suc- 
cessful chromosomes. 

For a generic, stable, and robust realization of these RAPGA ideas some 
practical aspects have to be considered: 


e The algorithm should offer the possibility to use different settings also 
for conventional parent selection, so that the selection mechanisms for 
the two parents do not necessarily have to be the same. In many exam- 
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FIGURE 4.2: Graphical representation of the gene pool available at a certain 
generation. Each bar represents a chromosome with its alleles representing 
the assignment of the genes at the certain loci. 


max. population size 


gene pool at generation i 
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FIGURE 4.3: The left part of the figure represents the gene pool at generation 
i and the right part indicates the possible size of generation i + 1 which must 
not go below a minimum size and also not exceed an upper limit. These 
parameters have to be defined by the user. 


ples a combination of proportional (roulette wheel) selection and random 
selection has already shown a lot of potential (for example in combina- 
tion with GP-based structure identification as discussed in [AWW08], 
e.g.). The two different selection operators are called male and female 
selection. It is also possible and reasonable in the context of the algo- 
rithmic concepts described here to disable parent selection totally, as 
scalable selection pressure comes along with the selection mechanisms 
after reproduction. This can be achieved by setting both parent selec- 
tion operators to random. 


e Due to the fact that reproduction results are only considered in case they 
are successful recombinations (and maybe mutations) of their parents’ 
chromosomes, it becomes reasonable to use more than one crossover op- 
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erator and more than one mutation operator at the same time. The 
reason for this possibility is given by the fact that only successful off- 
spring chromosomes are considered for the ongoing evolutionary process; 
this allows the application of crossover and mutation operators which do 
not produce good results mostly as long as they are still able to generate 
good offspring at least sometimes. On the one hand the insertion of such 
operators increases the average selection pressure and therefore also the 
average running time, but on the other hand these operators can help a 
lot to broaden evolutionary search and therefore retard premature con- 
vergence. If more than one crossover and mutation operator is allowed, 
the choice occurs by pure chance which has proven to produce better 
results than a preference of more successful operators [Aff05]. 


e As indicated in Figure 4.3, a lower as well as an upper limit of pop- 
ulation size are still necessary in order to achieve efficient algorithmic 
performance. In case of a missing upper limit the population size would 
snowball especially in the first rounds which is inefficient; a lower limit of 
at least 2 individuals is also necessary as this indicates that it is no more 
possible to produce a sufficient amount of chromosomes that are able to 
outperform their own parents and therefore acts as a good detector for 
convergence. 


e Depending on the problem at hand there may be several possibilities 
to fill up the next population with new individuals. If the problem 
representation allows an efficient check for genotypical identity, it is 
recommendable to do this and accept new chromosomes as members 
for the next generation if there is no structurally identical individual 
included in the population yet. If a check for genotypical identity is not 
possible or too time-consuming, there is still the possibility to assume 
two individuals are identical if they have the same fitness values as an 
approximative identity check. However, the user has to be aware of 
the fact that this assumption may be too restrictive in case of fitness 
landscapes with identical fitness values for a lot of different individuals; 
in such cases it is of course advisable to check for genotypical identity. 


e In order to terminate the run of a certain generation in case it is not pos- 
sible to fill up the maximally allowed population size with new successful 
individuals, an upper limit of effort in terms of generated individuals is 
necessary. This maximum effort per generation is the maximum number 
of newly generated chromosomes per generation (no matter if these have 
been accepted or not). 


e The question, whether or not an offspring is better than its parents, is 
answered in the same way as in the context of offspring selection. 


Figure 4.4 shows the typical development of the actual population size dur- 
ing an exemplary run of RAPGA applied to the ch130 benchmark instance of 
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FIGURE 4.4: Typical development of actual population size between the two 
borders (lower and upper limit of population size) displaying also the identical 
chromosomes that occur especially in the last iterations. 


the traveling salesman problem taken from the TSPLib [Rei91]. More sophis- 
ticated studies analyzing the characteristics of RAPGA will be presented in 
Chapter 7. 


4.4 Consequences Arising out of Offspring Selection and 
RAPGA 


Typically, GAs operate under the implicit assumption that parent individ- 
uals of above average fitness are able to produce better solutions as stated 
in Holland’s schema theorem and the related building block hypothesis. This 
general assumption, which ideally holds under the restrictive assumptions of 
a canonical GA using binary encoding, is often hard to fulfill for many prac- 
tical GA applications as stated in the questions of Chapter 3 which shall be 
rephrased and answered here in the context of offspring selection and RAPGA: 


ad 1. Is crossover always able to fulfill the implicit assumption that two above- 
average parents can produce even better children? 


Unfortunately, the implicit assumption of the schema theorem, namely 
that parents of above average fitness are able to produce even better 
children, is not accomplished for a lot of operators in many theoreti- 
cal as well as practical applications. This disillusioning fact has several 


ad 2. 


ad 3. 
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reasons: First, a lot of operators tend to produce offspring solution can- 
didates that do not meet the implicit or explicit constraints of certain 
problem formulations. Commonly applied repair strategies included in 
the operators themselves or applied afterwards have the consequence 
that alleles of the resulting offspring are not present in the parents which 
directly counteracts the building block aspect. In many problem repre- 
sentations it can easily happen that a lot of highly unfit child solution 
candidates arise even from the same pair of above average parents (think 
of GP crossover for example, where a lot of useless offspring solutions 
may be developed, depending on the concrete choice of crossover points). 
Furthermore, some operators have disruptive characteristics in the sense 
that the evolvement of longer building block sequences is not supported. 
By using offspring selection (OS) or the RAPGA the necessity that al- 
most every trial is successful concerning the results of reproduction is 
not given any more; only successful offspring become members of the 
active gene pool for the ongoing evolutionary process. 


Which of the available crossover operators is best suited for a certain 
problem in a certain representation? 


For many problem representations of certain applications a lot of 
crossover concepts are available where it is often not clear a priori which 
of the possible operators is suited best. Furthermore, it is often also not 
clear how the characteristics of operators change with the remaining 
parameter settings of the algorithm or how the characteristics of the 
certain operators change during the run of the algorithm. So it may 
easily happen that certain, maybe more disruptive operators perform 
quite well at the beginning of evolution whereas other crossover strate- 
gies succeed rather in the final (convergence) phase of the algorithm. 
In contrast to conventional GAs, for which the choice of usually one cer- 
tain crossover strategy has to be done in the beginning, the ability to use 
more crossover and also mutation strategies in parallel is an important 
characteristic of OS-based GAs and the RAPGA as only the successful 
reproduction results take part in the ongoing evolutionary process. It is 
also an implicit feature of the extended algorithmic concepts that when 
using more operators in parallel only the results of those will succeed 
which are currently able to produce successful offspring which changes 
over time. Even the usage of operator concepts that are considered 
evidentially weak for a certain application can be beneficial as long as 
these operators are able to produce successful offspring from time to 
time [Aff05]. 


Which of the resulting children are “good” recombinations of their par- 
ents’ chromosomes? 


Both OS and RAPGA have been basically designed to answer this ques- 
tion in a problem independent way. In order to retain generality, these 
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ad 4. 


ad 5. 
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algorithms have to base the decision if and to which extent a given repro- 
duction result is able to outperform its own parents by comparing the 
offspring’s fitness with the fitness values of its own parent chromosomes. 
By doing so, we claim that a resulting child is a good recombination 
(which is a beneficial building block mixture) worth being part of the 
active gene pool if the child chromosome has been able to surpass the 
fitness of its own parents in some way. 


What makes a child a “good” recombination? 


Whereas question 3 motivates the way, how the decision may be carried 
out whether or not a child is a good recombination of its parent chro- 
mosomes, question 4 intuitively asks why this makes sense. Generally 
speaking, OS and RAPGA direct the selection focus after reproduc- 
tion rather than before reproduction. In our claim this makes sense, 
as it is the result of reproduction that will be part of the gene pool 
and that has to keep the ongoing process alive. Even parts of chromo- 
somes with below average fitness may play an important role for the 
ongoing evolutionary process, if they can be combined beneficially with 
another parent chromosome which motivates gender specific parent se- 
lection [WA05b] as is for example applied in our GP experiments shown 
in the practical part (Chapter 11) of this book. With this gender spe- 
cific selection aspect, which typically selects one parent randomly and 
the other one corresponding to some established selection strategy (pro- 
portional, linear-rank, or tournament strategies) or even both parents 
randomly, we decrease selection pressure originating from parent selec- 
tion and balance this by increasing selection pressure after reproduction 
which is adjusted self-adaptively depending on how easy or difficult it 
is to achieve advancement. 


Which parts of the chromosomes of parents of above-average fitness are 
really worth being preserved? 


Ideally speaking, exactly those parts of the chromosomes of above- 
average parents should be transferred to the next generation that make 
these individuals above average. What may sound like a tautology at the 
first view cannot be guaranteed for a lot of problem representations and 
corresponding operators. In these situations, OS and RAPGA are able 
to support the algorithm in this goal which is essential for the building 
block assembling machines GAs and GP. 


Chapter 5 


SASEGASA — More than the Sum of 
All Parts 


The concept of offspring selection as described in Chapter 4 is very well 
suited to be transferred to parallel GA concepts. In the sense of parallel 
GA nomenclature, our proposed variant called SASEGASA (which stands for 
self-adaptive segregative genetic algorithm with simulated annealing aspects) 
is most closely related to the class of coarse-grained parallel GAs. The well- 
known island model supports global search by taking advantage of the steady 
pulsating interplay between breadth search and depth search supported by 
the forces of genetic drift and migration. 

SASEGASA acts differently by allowing the certain subpopulations to drift 
a lot longer through the solution space; exactly until premature convergence 
is detected in each of the subpopulations, which will be denoted as local pre- 
mature convergence. Then the algorithm aims very carefully to bring together 
the essential genetic information evolved in the certain demes individually. 

Concretely, the following main distinguishing features can be pointed out 
comparing SASEGASA to a coarse-grained island model GA: 


Dynamic Migration Intervals 


Migration happens no longer in predefined fixed intervals but at those points 
in time when local premature convergence is detected in the certain subpop- 
ulations. The indicator of local premature convergence is the exceeding of a 
certain amount of selection pressure which can be measured in an offspring 
selection GA. New genetic information is then added from adjacent subpop- 
ulations that suffer from local premature convergence themselves, but have 
evolved different alleles due to the undirected forces of genetic drift. By this 
strategy the genetic search process can be initiated again until local premature 
convergence is detected next time. 


From Islands to Growing Villages 


The most important difference from SASEGASA to the well known island 
model is given by the fact that in case of SASEGASA the size of the sub- 
populations is slowly growing by decreasing the number of subpopulations. 
Therefore, the migration aspect of SASEGASA can rather be associated with 
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a village-town-city model than with an island model. By this means the 
certain (at the beginning rather small) villages can drift towards different 
regions of the search space until they are all prematurely converged. Then 
the total number of subpopulations is decreased by one and the individuals 
are regrouped as sketched in Figure 5.1; then the new subpopulations evolve 
independently until local premature convergence is detected again for each 
subpopulation. So the initially rather small villages become larger and larger 
towns, finally forming a large panmictic population. The main idea of this 
strategy is that the essential alleles which may be shared over many different 
villages can slowly and carefully be collected in a single population result- 
ing in a high quality solution. As a consequence, parallelization is no more 
that efficient as in the island model, due to the changing number of subpop- 
ulations. More sophisticated communication protocols between the involved 
CPUs become necessary for efficient parallel implementations. 


5.1 The Interplay of Distributed Search and Systematic 
Recovery of Essential Genetic Information 


When applying GAs to higher dimensional problems in combinatorial opti- 
mization, it happens that genetic drift also causes alleles to fix to suboptimal 
properties, which causes a loss of optimal properties in the entire population. 
This effect is especially observable, if attributes of a global optimal solution 
with minor influence on the fitness function, are “hidden” in individuals with 
a bad total fitness. In that case parental selection additionally promotes the 
drop out of those attributes. 


This is exactly the point where considerations about multiple subpopula- 
tions, that systematically exchange information, come into play: 


Splitting the entire population into a certain number of subpopulations 
(demes) causes the separately evolving subpopulations to explore different ge- 
netic information in the certain demes due to the stochastic nature of genetic 
drift. Especially in the case of multimodal problems the different subpopu- 
lations tend to prematurely converge to different suboptimal solutions. The 
idea is that the building blocks of a global optimal solution are scattered in the 
single subpopulations, and the aim is to develop concepts to systematically 
bring together these essential building blocks in one population in order to 
make it possible to find a global optimal solution by crossover. In contrast to 
the various coarse- and fine-grained parallel GAs that have been discussed in 
the literature (good reviews are given in [AT99] and [Alb05] for instance), we 
have decided to take a different approach by letting the demes grow together 
step by step in case of local premature convergence. Even if this property 
does not support parallelization as much as established parallel GAs, we have 


SASEGASA - More than the Sum of All Parts 81 


decided to introduce this concept of migration as it proved to support the 
localization of global optimal solutions to a greater extent [Aff05]. Of course, 
the concept of self-adaptive selection pressure steering is essential, especially 
immediately after the migration phases, because these are exactly the phases 
where the different building blocks of different subpopulations have to be uni- 
fied. In classical parallel GAs it has been tried to achieve this behavior just 
by migration which is basically a good, but not very efficient idea, if no deeper 
thoughts about selection pressure are spent at the same time. 

As first experiments have already shown, there also should be a great po- 
tential in equipping established parallel GAs with our newly developed self- 
adaptive selection pressure steering mechanisms which leads to more stability 
in terms of operators and migration rates, automated detection of migration 
interval, etc. 


5.2 Migration Revisited 


In nature the fragmentation of the population of a certain species into more 
subpopulations of different sizes is a commonly observable phenomenon. Many 
species have a great area of circulation of various environments which leads 
to the formation of subpopulations. An important consequence of the popu- 
lation structure is the genetic differentiation of subpopulations, i.e., the shift 
of allele frequencies in the certain subpopulations. The reasons for genetic 
differentiation are: 


e Local adjustment of different genotypes in different populations 
e Genetic drift in the subpopulations 


e Random differences in the allele frequency of individuals which build up 
a new subpopulation 


The structure of the population is hierarchically organized in different layers!: 
e Individual 
e Subpopulation 
e Local population 


e Entire population (species) 


1The concept of hierarchical population structures has been introduced by Wright [Wri43]. 


82 Genetic Algorithms and Genetic Programming 


An important goal of population genetics is the detection of population struc- 
tures, the analyses of consequences and the location of the layer with most 
diverse allele frequencies. In this context a deeper consideration of genetic 
drift and its consequences is of major interest. The aspect of local adapta- 
tion of different genotypes in different populations should give useful hints 
for multi-objective function optimization or for optimization in changing en- 
vironments. 


One consequence of the population structure is the loss of heterozygosity 
(genetic variation). The Swedish statistician and geneticist Wahlund [HC89] 
described that genetic variation rises again, if the structure is broken up 
and mating becomes possible in the entire population. The SEGA and the 
SASEGASA algorithm, which will be described later, systematically take ad- 
vantage of this effect step by step. 


When talking about migration it is essential to consider some distinction 
concerning the genetic connection between the subpopulations which mainly 
depends on the gene flow (the exchange of alleles between subpopulations). 
Migration, the exchange of individuals, causes gene flow if and only if the 
exchanged individuals produce offspring. The most important effect of migra- 
tion and gene flow is the introduction of new alleles into the subpopulations. 
In that sense migration has effects similar to mutation, but can occur at much 
higher rates. If the gene flow between subpopulations is high, they become ge- 
netically homogeneous; in the case of little gene flow, the subpopulations may 
diverge due to selection, mutation, and drift. Population genetics provides 
a set of models for the theoretical analysis of gene flows. The most popu- 
lar migration models of population genetics are the mainland-island model, 
which considers migration in just one direction, and the island model, that 
allows migration in both directions. As discussed in parallel GA theory, the 
migration rate is an essential parameter for the description of migration. In 
population genetics the migration rate describes the ratio of chromosomes 
migrating among subpopulations. 


5.3 SASEGASA: A Novel and Self-Adaptive Parallel 
Genetic Algorithm 


We have already proposed several new EA-variants. The first prototype of 
this new class of evolutionary search which considers the concept of control- 
lable selection pressure ([Aff01c], [Aff02]) for information exchange between 
independently evolving subpopulations has been introduced with the Seg- 
regative Genetic Algorithm (SEGA) [Aff0la], [Aff01b]. Even if the SEGA is 
already able to produce very high quality results in terms of global solution 
quality, selection pressure has to be set by the user which is a very time con- 
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suming and difficult challenge. Further research, which aimed to introduce 
self-adaptive selection principles for the steering of selection pressure ({AW03], 
[AW04a]), resulted in the so-called SASEGASA-algorithm ([AW03], [Aff05]), 
which already represents a very stable and efficient method for producing 
high quality results without introducing problem-specific knowledge or local 
search. 

So far parallelism has “only” been introduced for improving global solution 
quality and all experiments have been performed on single-processor machines. 
Nevertheless, there is nothing to be said against transforming the concepts 
evolved in the parallel GA community to also improve the quantitative perfor- 
mance of our new methods. Empirical studies have shown that the theoretical 
concepts of the SASEGASA-algorithm [AW03] allow global solution quality 
to be steered up to the highest quality regions by just increasing the number 
of demes involved. The algorithm turned out to find the global optimum for 
all considered TSP benchmarks as well as for all considered benchmark test 
functions up to very high problem dimensions [Aff05]. 

Therefore an enormous increase of efficiency can be expected when applying 
concepts of supercomputing, allowing us to also attack much higher dimen- 
sional theoretical and practical problems efficiently in a parallel environment. 
Because of the problem independency of all newly proposed theoretical con- 
cepts there is no restriction to a certain class of problems that allows the 
attack of all problems for which GA theory (and also GP theory) offers ade- 
quate operators. 


5.3.1 The Core Algorithm 


In principle, the SASEGASA introduces two enhancements to the basic con- 
cept of genetic algorithms. Firstly, the algorithm makes use of variable selec- 
tion pressure, as introduced as offspring selection (OS) in Chapter 4, in order 
to self-adaptively control the goal-orientedness of genetic search. The second 
concept introduces a separation of the population to increase the broadness of 
the search process so that the subpopulations are joined after local premature 
convergence has occurred. This is done in order to end up with a population 
including all genetic information sufficient for locating a global optimum. 

The aim of dividing the whole population into a certain number of sub- 
populations (segregation) that grow together in case of stagnating fitness 
within those subpopulations (reunification) is to combat premature conver- 
gence which is the source of GA-difficulties. The basic properties (in terms of 
solution quality) of this segregation and reunification approach have already 
proven their potential in overcoming premature convergence [Aff0la], [Aff01b] 
in the so-called SEGA algorithm. 

By using this approach of breadth search, essential building blocks can 
evolve independently in different regions of the search space. In the case 
of standard GAs those relevant building blocks are likely to disappear early 
on due to genetic drift and, therefore, their genetic information can not be 
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provided at a later phase of evolution, when the search for a global optimum 
is of paramount importance. 

However, within the SEGA algorithm there is no criterion to detect prema- 
ture convergence, and there is also no self-adaptive selection pressure steering 
mechanism. Even if the results of SEGA are quite good with regard to global 
convergence [Aff01b], it requires an experienced user to adjust the selection 
pressure steering parameters, and as there is no criterion to detect premature 
convergence the dates of reunification have to be implemented statically. 
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FIGURE 5.1: Flowchart of the reunification of subpopulations of a 
SASEGASA (light shaded subpopulations are still evolving, whereas dark 
shaded ones have already converged prematurely). This figure is displayed 
with kind permission of Springer Science and Business Media. 


Equipped with offspring selection we have both: A self-adaptive selection 
pressure (depending on the given success ratio), as well as an automated detec- 
tion of local premature convergence, if the current selection pressure becomes 
higher than the given maximal selection pressure parameter (MaxSel Press). 
Therefore, a date of reunification has to be set, if local premature convergence 
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has occurred within all subpopulations, in order to increase genetic diversity 
again. Figure 5.1 shows a schematic diagram of the migration policy in the 
case of a reunification phase of the SASEGASA algorithm. The dark shaded 
subpopulations stand for already prematurely converged subpopulations. If 
all subpopulations are prematurely converged (dark shaded) a new reunifica- 
tion phase is initiated. 


Generations 


FIGURE 5.2: Quality progress of a typical run of the SASEGASA algorithm. 
This figure is displayed with kind permission of Springer Science and Business 
Media. 


Figures 5.2 and 5.3 show typical shape of the fitness curves and selection 
pressure progresses of a SASEGASA test run. The number of subpopulations 
is in this example set to 10. The vertical lines indicate dates of reunifica- 
tion. In the quality diagram (Figure 5.2) the lines give the fitness value of 
the best member of each deme; the best known solution is represented by the 
horizontal line. In the selection pressure diagram (shown in Figure 5.3) the 
lines stand for the actual selection pressure in the certain demes, as the actual 
quotient of evaluated solution candidates per round (in a deme) to the sub- 
population size. The lower horizontal line represents a selection pressure of 1 
and the upper horizontal line represents the maximum selection pressure. If 
the actual selection pressure of a certain deme exceeds the maximum selection 
pressure, local premature convergence is detected in this subpopulation and 
evolution is stopped in this deme (which can be seen in the constant value of 
the corresponding fitness curve) until the next reunification phase is started 
(if all demes are prematurely converged). 


With all the above described strategies, the complete SASEGASA algorithm 
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FIGURE 5.3: Selection pressure curves for a typical run of the SASEGASA 
algorithm. This figure is displayed with kind permission of Springer Science 
and Business Media. 


can be stated as described in Figure 5.4. 

Again, similar as in the context of offspring selection, it should be pointed 
out that a corresponding genetic algorithm is unrestrictedly included in 
SASEGASA, when the number of subpopulations (villages) is set to 1 and 
the success ratio is set to 0 at the beginning of the evolutionary process. 
Moreover, the introduced techniques also do not use any problem-specific in- 
formation. 


5.4 Interactions among Genetic Drift, Migration, and 
Self-Adaptive Selection Pressure 


Using all the introduced generic algorithmic concepts combined in 
SASEGASA, it becomes possible to utilize the interactions between genetic 
drift and the SASEGASA specific dynamic migration policy in a very advan- 
tageous way in terms of achievable global solution quality: 

Initially a certain number of subpopulations evolve absolutely indepen- 
dently from each other until no further evolutionary improvement is possible 
when using a genetic algorithm with offspring selection in the subpopulations, 
i.e., until local premature convergence is detected in all subpopulations. 

As a matter of principle, it is also the case that primarily those alleles 
are fixed in the certain subpopulations which currently influence the fitness 


SASEGASA - More than the Sum of All Parts 


nitialize current number of villages noOfVillages 
nitialize size of population |POP| 

nitialize current size of subpopulations |subPOP| = |POP| / noOfVillages 

nitialize upper bound for number of generations between recombination phases maxGenerations 


Generate subpopulations subPOP,, ..., SUubPOP.,, ,orvitiages| 
nitialize all subpopulations subPOP,, ..., subPOP.,,,orvinages| 


nitialize total number of generations totalGenerations = 0 
nitialize number of generations since last recombination phase currGenerations = 0 


ee. Are all subpopulations 


subPOP,, ..., subPOP,,, ,orvinages| converged oris 
currGenerations = maxGenerations ? 


no 


yes 


Generate next generation for each subpopulation subPOP,, ..., subPOP, 


|noofVillages| 
(as described in Fig. 4.1) 


totalGenerations = totalGenerations + 1 
currGenerations = currGenerations + 1 
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noOfVillages = noOfVillages - 1 
|subPOP| = |POP| / noOfVillages 
currGenerations = 0 


Reset all subpopulations subPOP,, ..., SUbPOP,,,,oyviniages, PY joining an 
appropriate number of adjacent subpopulation members 


FIGURE 5.4: Flowchart showing the main steps of the SASEGASA. This 
figure is displayed with kind permission of Springer Science and Business 


Media. 


function to a greater extent. The rest of the essential alleles with a currently 
low influence on the fitness value, which may be required for a global optimal 
solution later, are unlikely to be combined in any single subpopulation — 
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because basically these alleles are distributed over all subpopulations. 

As after this first optimization stage all subpopulations are prematurely 
converged to solutions with comparable fitness values, genetic information 
that has not been considered before is suddenly taken into account by selec- 
tion after a reunification phase. We so establish a step-by-step reunification 
of independently evolving subpopulations, which is triggered by the detection 
of convergence; in this way it becomes possible to systematically consider the 
essential alleles at exactly those points in time when they are in the position 
to become accepted in the newly emerging greater subpopulations. By means 
of this procedure smaller building blocks? of a global optimal solution are 
step-by-step enriched with new alleles, in order to evolve to larger and larger 
building blocks ending up in one single subpopulation, containing all genetic 
information being required for a global optimal solution. 


With an increasing number of equally sized subpopulations, the probability 
that essential alleles are barred from dying off increases due to the greater 
total population size; this results in a higher survival probability of the alleles 
which are not yet considered. 


As empirically demonstrated in Chapters 7, 10, and 11, the procedure de- 
scribed above makes it possible to find high quality solutions for more and 
more difficult problems by simply increasing the number of subpopulations. 
Of course this causes increasing computational costs; still, this growth in com- 
putational costs is not exponential, but linear. 


2The notation of building blocks is considered in a more general interpretation than in 
Holland’s definition. 


Chapter 6 


Analysis of Population Dynamics 


There are several aspects of dynamics in populations of genetic algorithms 
that can be observed and analyzed. In this section we shall describe these 
aspects which we have concentrated on and which will also be analyzed for 
evaluating different algorithmic GA settings on various problem instances: 


e In Section 6.1 we describe how we analyze which individuals of the 
population succeed in passing their genetic information on to the next 
generation. 


e In Section 6.2 we give a summary of approaches for analyzing the diver- 
sity among populations of GAs using some kind of similarity measure 
for solution candidates. We use these concepts to measure how diverse 
the individuals of populations are as well as how similar populations of 
multi-population GAs become during runtime. 


Furthermore, in Chapter 10 we will analyze the dynamics of population di- 
versity over time for the combinatorial optimization problems TSP (see Sec- 
tion 10.1) and CVRP (see Section 10.2) on the basis of GA variants considered 
in this book. 


6.1 Parent Analysis 


In the context of conventional GAs, parent selection is normally responsible 
for selecting fitter individuals more often than those that are less fit. Thus, 
fitter individuals are supposed to pass on their genetic material to more mem- 
bers of the next generation. 

When using offspring selection, several additional aspects have to be con- 
sidered. As only those children survive this selection step that perform better 
than their parents to a certain degree, we cannot guarantee that fitter parents 
succeed more often than less fit ones. 

This is why we document the parent indices of all successful offspring for 
each generation step. So we can analyze whether all parts of the population 
are considered for effective propagation of their genetic information, if whether 
only better ones or rather bad ones are successful. 
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Formally, in parent analysis we analyze the genetic propagation of parents 
P to their children C calculating the propagation count pc for each parent 
as the number of successful children it was able to produce by being crossed 
with other parents or mutation: 


. 1 : péc.Parents 
isParent(p,c) = T E (6.1) 
V(p € P) : pe(p) = 5 isParent(p, c) (6.2) 


cEC 


In addition, we can optionally weight the propagation count for each poten- 
tial parent by weighting it with the similarity of the parent and its children 
(supposing the availability of a similarity function sim which can be used for 
calculating the similarity of solution candidates): 


V(p € P) : pc'(p) = 5 isParent(p, c) » sim(p, c) (6.3) 
cEC 


This kind of population dynamics analysis shall be used later in Section 11.3, 
where we will see how enhanced selection models affect genetic propagation 
in data-based modeling using genetic programming. 


6.2 Genetic Diversity 


In this section we describe the measures which we use to monitor the diver- 
sity and population dynamics with respect to the genetic make-up of solution 
candidates using some kind of similarity measurement function that estimates 
the mutual similarity of solution candidates. 

As we know that similarity measures do not have to be symmetric (see 
Section 9.4 for examples and explanations), we can alternatively use the mean 
value of the two possible similarity calls and so define a symmetric similarity 
measurement: 


sim(my1, m2) + sim(me, Mı) 


2 


symmetricAnalysis > sim(mı, M2) = 


(6.4) 


6.2.1 In Single-Population GAs 


In the context of single-population GAs we are mainly interested in the 
similarity among the individuals of the population: For each solution s of 
the population P we calculate the mean and the maximum similarity with all 
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other individuals in the population: 


1 
meanSim(s, P) = P 5 sim(s, s2) (6.5) 
PI i EETA 
maxSim(s, P) = max(s2¢P,s24s) (sim(s, 82)) (6.6) 


The mean values of all individuals’ similarity values are used for calculating 
the average mean and average maximum similarity measures for populations: 


meanSim(P) = DF meanSim(s, P) (6.7) 
IPI sEP 
1 
maxSim(P) = Pl DD maxSim(s, P) (6.8) 
sEP 


6.2.2 In Multi-Population GAs 


In the context of parallel evolution of populations in genetic algorithms, 
which is summarized in Section 1.7, we can apply the population diversity 
analysis for each population separately; in the following we will describe a 
multi-population specific diversity analysis. 

Basically, a solution s is compared to all solutions in another population P’ 
which does not include s, and multiPopSim(s, P’) is equal to the maximum 
of the so calculated similarity values: 


s ¢ P' => multiPopSim(s, P’) = max(s2¢ p) (sim(s, s2)) (6.9) 
So we can calculate the multi-population similarity of a solution with respect 


to a set of populations PP as the average multiPopSim of the solution to all 
populations except the “own” one: 


(s € PAP € PP) => PP! = {P' : P! € PPAP + Pj, (6.10) 
1 
multiPopSim(s, PP) = PP] p28 multiPopSim(s, P’), (6.11) 


Finally, a population’s multiPopSim value is equal to all its solutions’ multi- 
population similarity values with respect to the whole set of populations: 


multiPopSim(P, PP) =e multiPopSim(s, PP) (6.12) 


~ (PI sEP 
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6.2.3 Application Examples 


As we are aware that these formulas for calculating the genetic diversity 
in populations might seem a bit confusing, we shall here give application 
examples and some graphical illustrations of the results so that the main 
ideas become a bit clearer. 

For this purpose we have decided to let a genetic algorithm search for an 
optimal solution for a specific instance of the traveling salesman problem, 
namely the 130 cities problem ch130 taken from the TSPLIB [Rei91]. 

The first algorithm tested was a conventional GA with 100 solution can- 
didates, order crossover, and 5% mutation rate; it was executed over 10,000 
generations, and the similarity for each pair of solutions in the population was 
measured every 10 iterations. 

We hereby calculate the similarity of TSP solutions in the following way: 
As each solution is given as a path, we can find out which city is visited 
after another one for each step in the tour. These edges of the cities graphs 
are considered for calculating the similarity of tours: The proportion of edges 
that are common in both tour graphs represents the similarity of TSP solution 
candidates. In a more formal way we can use the following edge definition!: 


isedge(|i, j], t) = 
dk: (t{k] =i A tik +1] = 9) V (t[kn] =i A t[0] = j) (6.13) 


where n is the number of cities in the given TSP problem instance (in the 
case of the ch130 problem, n = 130), and define the similarity of two tours tı 
and tz as sim(tı, tz) as 


lili, j] : (isedge(li, J], tı) A isedge(li, j]; t2))}| 


sim(ty, t2) = (6.14) 

Figure 6.1 shows a graphical representation of the genetic diversity in the 
GA’s population after 20 and 200 generations, shown on the left and right side 
of the figure. For each pair of solutions (at indices i and 7) the similarity of 
these two solutions is represented by a square at position (i, j); light squares 
represent small similarity values, while dark squares indicate high similarities. 
Obviously the genetic diversity is high in the beginning of the GA run, but 
decreases very soon as the similarity of solutions becomes very high already 
after 200 rounds. 

The histograms shown in Figure 6.2 sustain this impression: As most sim- 
ilarity values are rather low (below 0.25) in the beginning, i.e., after 20 itera- 
tions, most pairs of solutions show high similarity values between 0.5 and 0.8 
after 200 generations. 


lIn the case of symmetric TSP instances isedge((%, j], t) of course implies isedge([j, i], t). 
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FIGURE 6.1: Similarity of solutions in the population of a standard GA after 
20 and 200 iterations, shown in the left and the right charts, respectively. 


Finally, Figure 6.3 shows the progresses of all solutions’ average similarity 
values for the first 2,000 and 10,000 generations, respectively. As we can see 
clearly, most similarity values reach a very high level very soon, after a bit 
more than 500 iterations the overall average (shown as a black line) reaches 
0.95; at the end of the GA run almost all pairs of solutions show a similarity 
of more than 0.9, the average being approximately 0.96. 


For demonstrating the possibilities how to graphically represent multi- 
population specific similarities, we have set up a parallel GA with 4 popu- 
lations each evolving exactly like the GA described for demonstrating single 
population similarity. Thus, the algorithm contains 4 populations each stor- 
ing 100 solutions for the ch130 problem, evolving over 10,000 generations by 
order crossover and 5% mutation. 


Figure 6.4 graphically represents the multi-population specific similarity 
for each solution in the 4 given populations after 5,000 generations: In row i 
of bar j we give the maximum similarities of the ith solution in population 
j compared to all solutions of all other populations using Formula 6.9; the 
maximum similarities with other populations are given column wise. In each 
bar k we intentionally omit column k as the maximum similarity of a solution 
with its own population does not make sense in the context of the analysis of 
multi-population genetic diversity. For example, in column 1 of row 20 in bar 
1 we represent the maximum similarity of solution 20 in population 1 with all 
solutions in population 2, and in column 2 of row 10 in hyper-column 2 we 
show the maximum similarity of solution 10 in population 2 with all solutions 
in population 3. Again, higher similarity values are represented by darker 
regions. 
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FIGURE 6.2: Histograms of the similarities of solutions in the population of 
a standard GA after 20 and 200 iterations, shown in the left and the right 
charts, respectively. 


As we see in Figure 6.4, all solutions have rather high similarity with at 
least one solution of all other populations. This tendency is also shown in 
Figure 6.5 in which we give the average multi-population similarity values for 
each solution calculated using 6.11; the black line stands for the average of 
these values, which is equal to the overall average value calculated using 6.12. 
As we see, the populations become more and more similar to each other as 
the parallel GA is executed. 
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FIGURE 6.3: Average similarities of solutions in the population of a standard 
GA over for the first 2,000 and 10,000 iterations, shown in the upper and lower 
charts, respectively. 
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FIGURE 6.4: Multi-population specific similarities of the solutions of a par- 
allel GA’s populations after 5,000 generations. 
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FIGURE 6.5: Progress of the average multi-population specific similarity val- 
ues of a parallel GA’s solutions, shown for 10,000 generations. 


Chapter 7 


Characteristics of Offspring 
Selection and the RAPGA 


7.1 Introduction 


In this chapter we will try to look inside the internal functioning of several 
GA variants already discussed in previous chapters. For this purpose we 
use the information about globally optimal solutions which is only available 
for well studied benchmark problems of moderate dimension. Of course, the 
applied optimization strategies (i.e., in our case variants of GAs) are not 
allowed to use any information about the global optimum; we just use this 
information for analysis purposes in order to obtain a better understanding 
of the internal functioning and the dynamics of the most relevant algorithmic 
concepts discussed so far. 

A basic requirement for this (to a certain extent idealized) kind of analysis 
is the existence of a unique globally optimal solution which has to be known. 
Concretely, we aim to observe the distribution of the alleles of the global 
optimal solution over the generations in order to observe the ability of the 
certain algorithmic variants to preserve and possibly regain essential genetic 
material during the run of the algorithm. 

The main aim of this book in general and especially of this chapter is not 
to give a comprehensive analysis of many different problem instances, but 
rather to highlight the main characteristics of the certain algorithm variants. 
For this kind of analysis as given in this chapter we have chosen the trav- 
eling salesman problem (TSP), mainly because it is a well known and well 
analyzed combinatorial optimization problem and a lot of benchmark prob- 
lem instances are available. We here concentrate on the ch130 TSP instance 
taken from the TSPLib [Rei91], for which the unique globally optimal tour 
is known; the characteristics of the global optimum of this 130 city TSP in- 
stance are exactly the 130 edges of the optimal tour which denote the essential 
genetic information as stated in Chapter 3. In contrast to the analyses de- 
scribed in Chapter 10, we here rather show results of single characteristical 
test runs in order to identify the essentially important algorithmic features; 
for statistically more significant tests the reader is referred to Chapter 10. 


In a broader interpretation of the building block theory, these alleles should 
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on the one hand be available in the initial population of a GA run, and on the 
other hand maintained during the run of the algorithm. If essential genetic 
information is lost during the run, then mutation is supposed to help regaining 
it in order to be able to eventually find the globally optimal solution (or at least 
a solution which comes very close to the global optimum). In order to observe 
the actual situation in the population we display each of the 130 essential edges 
as a bar indicating the saturation of each allele in the population so there are 
in total 130 bars. The disappearance of a bar therefore indicates the loss of 
the corresponding allele in the entire population, whereas a full bar indicates 
that the certain allele occurs in each individual (which is the desired situation 
at the end of an algorithm run). As a consequence, the relative height of a 
bar stands for the actual penetration level of the corresponding allele in the 
individuals of the population and the observation of the dynamic behavior 
allows observing the distribution of essential genetic information during the 
run. 

Of course one has to keep in mind that this special kind of analysis can 
only be performed when the unique globally optimal solution is available; 
usually, this information is not available in real world applications where we 
can only observe genetic diversity which will be done in Chapters 10 and 11. 
Nevertheless, these somehow idealized conditions allow very deep insight into 
the dynamic behavior of certain algorithm variants which can also be extended 
to other more practically relevant problem situations. 

In the following, the distribution of essential genetic information and its 
impact on achievable solution quality will be discussed for the standard GA, 
a GA variant including offspring selection as well as for the relevant alleles 
preserving GA (RAPGA) as introduced in Chapter 4. 


7.2 Building Block Analysis for Standard GAs 


For observing the distribution of essential alleles in a standard GA we have 
used the following test strategy: First, our aim was to observe the solution 
quality achievable with parameter settings that are quite typical for such kinds 
of GA applications (as given in Table 7.1) using the well known operators 
for the path representation, namely OX, ERX, and MPX; each algorithmic 
variant has been analyzed applying no mutation as well as mutation rates of 
5% and 10%. 

The following Figures 7.1, 7.2, and 7.3 show the fitness curves (showing 
best and average solution qualities of the GA’s population as well as the best 
known quality) for a standard GA using OX (see Figure 7.1), ERX (Figure 
7.2), and MPX (Figure 7.3), respectively; the parameter settings used for 
these experiments are given in Table 7.1. 
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Table 7.1: Parameters for test runs using a conventional GA. 


Parameters for the conventional GA tests 


(Results are graphically presented in Figures 7.1, 7.2, and 7.3) 
Generations 


20,000 
Population Size 100 
Elitism Solutions 1 
Mutation Rate 0.00 or 0.05 or 0.1 
Selection Operator Roulette 


Crossover Operator 


OX (Fig. 7.1), ERX (Fig. 7.2) or MPX (Fig. 7.3) 
Mutation Operator 


Simple Inversion 


Quality Progress 


45000. 


Average Quality Values 


—— Lowest Quality Values 


Best Known Quality 


40000. 
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FIGURE 7.1: Quality progress for a standard GA with OX crossover for 
mutation rates of 0%, 5%, and 10%. 


For the OX crossover, which achieved the best results with the standard 
parameter settings, the results are shown in Figure 7.1; it is observable that 
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the use of mutation rates of 5% and 10% leads to achieving quite good re- 
sults (about 5% to 10% worse than the global optimum), whereas disabling 
mutation leads to a rapid loss of genetic diversity so that the solution quality 
stagnates at a very poor level. 

The use of more edge preserving crossover operators ERX and MPX (for 
which results are shown in Figures 7.2 and 7.3) shows different behavior in 
the sense that, applying the same parameter settings as used for the OX, 
the results are rather poor independent of the mutation rate. The reason for 
this is that these operators require more selection pressure (as for example 
tournament selection with tournament size 3); when applying higher selection 
pressures it is possible to achieve comparably good results also with ERX and 
MPX. Still, also when applying parameter settings which give good results 
with appropriate mutation rates, the standard GA fails dramatically when 
disabling mutation. 

When applying selection pressures which are sufficiently high to promote 
the genetic search process beneficially, mutation is absolutely necessary to 
achieve high quality results using a standard GA. Only if no alleles are fixed 
and therefore no real optimization process takes place, disabling mutation 
would not cause stagnation of evolutionary search. 

Summarizing these aspects we can state for the SGA applied to the TSP 
that several well suited crossover operators” require totally different combina- 
tions of parameter settings in order to make the SGA produce good results. 
Considering the results achieved with the parameter setting as stated in Table 
7.1, the use of the OX yields good results (around 10% worse than the global 
optimum) whereas the use of ERX and MPX leads to unacceptable results 
(more than 100% worse than the global optimum). On the contrary, tuning 
the residual parameters (population size, selection operator) for ERX or MPX 
would cause poor solution quality for OX. 

Thus, an appropriate adjustment of selection pressures is of critical impor- 
tance; as we will show in the following, self-adaptive steering of the selection 
pressure is able to make the algorithm more robust as selection pressure is 
adjusted automatically according to the actual requirements. 

Figure 7.4 shows the distribution of the 130 essential alleles of the unique 
globally optimal solution over time for the overall best parameter constellation 
found in this section, i.e., the use of OX crossover with 5% mutation rate. 
In order to make the snapshots for the essential allele distribution within the 
SGA’s population comparable to those captured applying a GA with offspring 
selection or the RAPGA, the timestamps are not given in iterations but rather 


lIn order to keep this chapter compact and on an explanatory level, detailed parameter 
settings and the corresponding analysis are not given here; interested readers are kindly 
invited to reproduce these results using HeuristicLab 1.1. 

2OX, ERX, and MPX are all edge preserving operators and therefore basically suited for 
the TSP. 
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FIGURE 7.2: Quality progress for a standard GA with ERX crossover for 
mutation rates of 0%, 5%, and 10%. 


in the number of evaluations (which is in the case of the SGA equal to the 
population size times the number of generations executed). 

Until after about 10.000 evaluations, i.e., at generation 100, we can observe 
quite typical behavior, namely the rise of certain bars (representing the exis- 
tence of edges of the global optimum). However, what happens between the 
10.000th and 20.000th evaluation is that some of the essential alleles (about 
15 in our test run) become fixed whereas the rest (here about 130 — 15 = 115 
in our test run) disappears in the entire population. As we can see in Figure 
7.4, without mutation the genetic search process would already be over at 
that moment due to the fixation of all alleles; from now on mutation is the 
driving force behind the search process of the SGA. 

The effects of mutation in this context are basically as follows: Sometimes 
high quality alleles are (by chance) injected into the population, and if those 
are beneficial (not even necessarily in the mutated individual), then a suited 
crossover operator is able to spread newly introduced essential allele informa- 
tion over the population and achieve a status of fixation quite rapidly. Thus, 
most of the essential alleles can be reintroduced and fixed approximately be- 
tween the 20.000th and 2.000.000th evaluation. 
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FIGURE 7.3: Quality progress for a standard GA with MPX crossover for 
mutation rates of 0%, 5%, and 10%. 


However, even if this procedure is able to fulfill the function of optimiza- 
tion reasonably good when applying adjusted parameters, it has not much in 
common with the desired functioning of a genetic algorithm as stated in the 
schema theorem and the according building block hypothesis. According to 
this theory, we expect a GA to systematically collect the essential pieces of 
genetic information which are initially spread over the chromosomes of the 
initial population as reported for the canonical GA.? As we will point out in 
the next sections, GAs with offspring selection as well as RAPGA are able to 
considerably support a GA to function in exactly that way even under not so 
idealized conditions as required in the context of the canonical GA. 


3This statement is in fact restricted to binary encoding, single point crossover, bit-flip 
mutation, proportional selection, and generational replacement; see Chapter 1 for further 
explanations. 
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FIGURE 7.4: Distribution of the alleles of the global optimal solution over 
the run of a standard GA using OX crossover and a mutation rate of 5% 
(remaining parameters are set according to Table 7.1). 


7.3 Building Block Analysis for GAs Using Offspring 
Selection 


The aim of this section is to highlight some characteristics of the effects of 
offspring selection. In order to do so, we have chosen very strict parameter 
settings (no parental selection, strict offspring selection with 100% success 
ratio) which are given in Table 7.2. As the termination criterion of a GA 
with offspring selection is self-triggered, the effort of these test runs is not 
constant; however, the parameters are adjusted in a way that the total effort 
is comparable to the effort of the test runs for the SGA building block analyses 
discussed in Section 7.2. 
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Table 7.2: Parameters for test runs using a GA with offspring selection. 


Parameter settings for the offspring selection GA runs 
(Results are graphically presented in Figures 7.5, 7.6, 7.7, and 7.8) 


Population Size 500 
Elitism Solutions 1 
Mutation Rate 0.00 or 0.05 
Selection Operator Random 


OX , MPX, ERX, or combinations 
Simple Inversion 


Crossover Operator 
Mutation Operator 


Success Ratio 1.0 
Comparison Factor Bounds 1.0 
Maximum Selection Pressure 300 


Similarly as for the SGA, we here take a look at the performance of some 
basically suited (edge preserving) operators. The results shown in Figures 7.5, 
7.6, and 7.7 highlight the benefits of self-adaptive selection pressure steering 
introduced by offspring selection: Independent of the other parameter set- 
tings, the use of all considered crossover operators yields results near the 
global optimum. 

As documented in the previous section we have observed that the standard 
GA heavily relies on mutation, when the selection pressure is adjusted at 
a level that allows the SGA to search in a goal oriented way. Therefore, 
we are now especially interested in how offspring selection can handle the 
situation when mutation is disabled. Figure 7.8 shows the quality curves 
for the use of the ERX crossover operator (which achieved the best results 
with 5% mutation) without mutation and the same settings for the remaining 
parameters. The remarkable result is that the result is practically as good as 
with mutation, which at the same time means that offspring selection does 
not rely on the genetic diversity regaining aspect of mutation. Furthermore, 
this also means that offspring selection is able to keep the essential genetic 
information which in the concrete example is given by the alleles of the globally 
optimal solution. When using offspring selection (in contrast to the SGA) 
the algorithm is not only able to keep the essential genetic information, but 
slowly merges the essential building blocks step by step which complies with 
the core statements of the building block theory and is not restricted to binary 
encoding or the use of certain operators. 

This behavior of offspring selection, which is very important for practical 
applications, comes along with the property from which the method derived 
its name: Due to offspring selection only those children take part in the ongo- 
ing evolutionary process which were successful offspring of their own parents. 
Thus, one implicit assumption of the schema theory and the building block 
hypothesis holds, which would not be valid for a lot of practical applications: 
We enhance the evolutionary process in such a way that two parents (with 
above average qualities) are able to produce offspring of comparable or even 
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FIGURE 7.5: Quality progress for a GA with offspring selection, OX, and a 
mutation rate of 5%. 


better fitness, and that exactly these children take part in the ongoing evolu- 
tionary process. 

In previous publications we have even gone one step further: We have 
been able to show in [AW04b] and [Aff05] that even with crossover opera- 
tors basically considered unsuitable for the TSP (as they inherit the position 
information rather than the edge information like CX or PMX) it becomes 
possible to achieve high quality results in combination with offspring selec- 
tion. The reason is the sufficiency that a crossover operator is able to produce 
good recombinations from time to time (as only these are considered for the 
future gene pool); the price which has to be paid is that higher average selec- 
tion pressure has to be applied, if the crossover operator is more unlikely to 
produce successful offspring. 

For the path representation of the TSP, the characteristics of the crossover 
operators mentioned before are well analyzed. However, when it comes to 
practical applications it is often not known a priori, which of the possible 
crossover concepts will perform well. In this context a further interesting 
aspect of offspring selection becomes obvious: As only the successful crossover 
results are considered for the ongoing evolutionary process, we can simply 


106 Genetic Algorithms and Genetic Programming 


Quality Progress 


45000 Average Quality Values 


Lowest Quality Values 


400004 


Best Known Quality 


35000 


30000 


Quality Values 


25000 


20000 


15000 


10000 


Generations 


FIGURE 7.6: Quality progress for a GA with offspring selection, MPX, and 
a mutation rate of 5%. 


apply several different crossover operators and select one of those at random 
for each crossover. 

As a proof of concept for applying more than one crossover at the same 
time, we have repeated the previous test runs with OX, MPX, and ERX with 
the only difference that for these tests all crossover operators have been used. 
Figure 7.9 shows the quality curves (best and average results) for this test 
run and shows that the results are in the region of the global optimal solution 
and therefore at least as good as in the test runs before. A further question 
that comes along using multiple operators at once is their performance over 
time: Is the performance of each of the certain operators relatively constant 
over the run of the algorithm? 

In order to answer this question, Figure 7.10 shows the ratio of successful 
offspring for each crossover operator used (in the sense of strict offspring 
selection which requires that successful children have to be better than both 
parents). Figure 7.10 shows that ERX performs very well at the beginning 
(approximately until generation 45) as well as in the last phase of the run (circa 
from gen. 75). In between (approximately from generation 45 to generation 
75), when the contribution of ERX is rather low, MPX shows significantly 
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FIGURE 7.7: Quality progress for a GA with offspring selection, ERX, and 
a mutation rate of 5%. 


better performance. The performance of OX in terms of its ability to generate 
successful offspring is rather mediocre during the whole run showing very little 
success in the last phase. The analysis of reasons of the behavior of the certain 
operators over time would be an interesting field of research; anyway, it is 
already very interesting to observe that the performance characteristics of the 
operators are changing over time to such an extent. 

For a more detailed observation of the essential alleles during the runs 
of the GA using offspring selection we show the allele distribution for the 
ERX crossover, which achieved slightly better results than the other crossover 
operators, in Figure 7.11. However, the characteristics of the distribution of 
essential alleles are quite similar also for the other crossover operators when 
using offspring selection. As a major difference in comparison to the essential 
allele distributions during a standard GA, we can observe that the diffusion 
of the essential alleles is established in a rather slow and smooth manner. 
The essential alleles are neither lost nor fixed in the earlier stages of the 
algorithm, so the bars indicating the occurrence of the certain essential allele 
(edges of the optimal TSP path) in the entire population are growing steadily 
until almost all of them are fixed by the end of the run. This behavior not 
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FIGURE 7.8: Quality progress for a GA with offspring selection, ERX, and 
no mutation. 


only indicates a behavior in accordance with the building block hypothesis, 
but also implies that the algorithm performance no more relies on mutation 
to an extent as observed for the corresponding SGA analyses. In order to 
confirm this assumption we have repeated the same test without mutation 
and indeed, as it can be seen by a comparison of Figures 7.11 and 7.12, the 
saturation behavior of the essential building blocks is basically the same, no 
matter if mutation is used or not. This is a remarkable observation as it 
shows that offspring selection enables a GA to collect the essential building 
blocks represented in the initial population and compile high quality solutions 
very robustly in terms of parameters and operators like mutation, selection 
pressure, crossover operators, etc. This property is especially important when 
exploring new fields of application where suitable parameters and operators 
are usually not known a priori. 
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FIGURE 7.9: Quality progress for a GA with offspring selection using a 
combination of OX, ERX, and MPX, and a mutation rate of 5%. 
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FIGURE 7.10: Success progress of the different crossover operators OX, ERX, 
and MPX, and a mutation rate of 5%. The plotted graphs represent the ratio 
of successfully produced children to the population size over the generations. 
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FIGURE 7.11: Distribution of the alleles of the global optimal solution over 
the run of an offspring selection GA using ERX crossover and a mutation rate 
of 5% (remaining parameters are set according to Table 7.2). 
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FIGURE 7.12: Distribution of the alleles of the global optimal solution over 
the run of an offspring selection GA using ERX crossover and no mutation 
(remaining parameters are set according to Table 7.2). 
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7.4 Building Block Analysis for the Relevant Alleles 
Preserving GA (RAPGA) 


Similar to the previous section we aim to highlight some of the most charac- 
teristic features of the relevant alleles preserving GA as introduced in Section 
4.3. The parameter settings of the RAPGA as given in Table 7.3 are also 
described in Section 4.3. 


Table 7.3: Parameters for test runs using the relevant alleles preserving genetic 
algorithm. 
Parameters for the RAPGA tests 
(Results presented in Fig.7.13, Fig.7.14, Fig.7.15, Fig.7.16, and Fig.7.17) 


Max. Generations 1,000 
Initial Population Size 500 
Mutation Rate 0.00 or 0.05 
Elitism Rate 1 
Male Selection Roulette 
Female Selection Random 
Ox 

ERX 

Crossover Operators MPX 


Mutation Operator 


combined (OX, ERX, and MPX) 


Simple Inversion 


Minimum Population Size 5 
Maximum Population Size 700 
Twin Exclusion true 
Check Structural Identity true 
Effort 20,000 
Comparison Factor Bounds 1 tol 
Attenuation 0 


The main characteristics of the RAPGA are quite similar to a GA us- 
ing offspring selection. The most important aspects of offspring selection 
are implicitly included in RAPGA; additionally, the RAPGA also introduces 
adaptive population size adjustment in order to support offspring selection 
to exploit the available genetic information in the actual population to the 
maximum in terms of achieving new (in order to maintain diversity) and even 
better (the offspring selection aspect) solution candidates for the next gener- 
ation. RAPGA is a rather young algorithmic idea which has been presented 
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in [AWW07]. Nevertheless, there is evidence that the RAPGA is comparably 
generic and flexible as offspring selection has already proven to be for a wide 
range of GA and GP applications. 


The following experiments are set up quite similar to the offspring selection 
experiments of the previous section. Firstly, the considered operators OX, 
MPX, and ERX as well as their combination (OX, MPX, ERX) are applied 
to the ch130 benchmark TSP problem taken from the TSPLib. Then the most 
successful operator or operator combination, respectively, is also exemplarily 
considered without mutation in order to show that the RAPGA like offspring 
selection does not rely on mutation to such an extent as conventional GAs. 
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FIGURE 7.13: Quality progress for a relevant alleles preserving GA with OX 
and a mutation rate of 5%. 


Already the experiments using OX (see Figure 7.13) and MPX (Figure 7.14) 
show good results (approximately 5% — 10% worse than the globally optimal 
solution) which are even slightly better than the corresponding offspring se- 
lection results. Even if only single test runs are shown in this chapter it has 
to be pointed out that the authors have taken care that characteristical runs 
are shown. Besides, as can be seen in the more systematical experiments of 
Chapter 10, especially due to the increased robustness caused by offspring 
selection and RAPGA the variance of the results’ qualities is quite small. 


Similar to what we stated for the OS analyses, also for the RAPGA the 
best results could be achieved using ERX (as shown in Figure 7.15) as well 
as using the combination of OX, ERX, and MPX (see Figures 7.16 and 7.17). 
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FIGURE 7.14: Quality progress for a relevant alleles preserving GA with 
MPX and a mutation rate of 5%. 
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FIGURE 7.15: Quality progress for a relevant alleles preserving GA with ERX 
and a mutation rate of 5%. 


The achieved results using these operators are about 1% or even less worse 
than the global optimal solution. In the case of the RAPGA the operator 
combination turned out to be slightly better than ERX (in 18 of 20 test 
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runs). Therefore, this is the operator combination we have also considered for 
a detailed building block analysis without mutation as well as applying 5% 
mutation. 
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FIGURE 7.16: Quality progress for a relevant alleles preserving GA using a 
combination of OX, ERX, and MPX, and a mutation rate of 5%. 
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FIGURE 7.17: Quality progress for a relevant alleles preserving GA using a 
combination of OX, ERX, and MPX, and mutation switched off. 
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Barely surprising, the results of RAPGA with the operator combination 
consisting of OX, ERX, and MPX turned out to be quite similar to those 
achieved using offspring selection and the ERX operator. Due to the name 
giving aspect of essential allele preservation, disabling mutation (see Figures 
7.16 and 7.17) has almost no consequences concerning achievable global solu- 
tion quality. Even without mutation the results are just 1-2% worse than the 
global optimum. The distributions of essential alleles over the generations of 
the RAPGA run (as shown in Figure 7.18 and Figure 7.19) also show quite 
similar behavior as already observed in the corresponding analyses of the ef- 
fects of offspring selection. Almost all essential alleles are represented in the 
first populations and their diffusion is slowly growing over the GA run, and 
even without mutation the vast majority of essential alleles is fixed by the end 
of the RAPGA runs. 

Summarizing these results, we can state that quite similar convergence be- 
havior is observed for a GA with offspring selection and the RAPGA, which 
is characterized by efficient maintenance of essential genetic information. As 
shown in Section 7.2, this behavior (which we would intuitively expect from 
any GA) cannot be guaranteed in general for GA applications where it was 
mainly mutation which helped to find acceptable solution qualities. 
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FIGURE 7.18: Distribution of the alleles of the global optimal solution over 
the run of a relevant alleles preserving GA using a combination of OX, ERX, 
and MPX, and a mutation rate of 5% (remaining parameters are set according 
to Table 7.3). 
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FIGURE 7.19: Distribution of the alleles of the global optimal solution over 
the run of a relevant alleles preserving GA using a combination of OX, ERX, 


and MPX without mutation (remaining are set parameters according to Ta- 
ble 7.3). 


Chapter 8 


Combinatorial Optimization: Route 
Planning 


There are a great many of combinatorial optimization problems that genetic 
algorithms have been applied on so far. In the following we will concentrate on 
two selected route planning problems with a lot of attributes which are repre- 
sentative for many combinatorial optimization problems, namely the traveling 
salesman problem (TSP) and the vehicle routing problem (VRP). 

The traveling salesman problem is certainly one of the classical as well as 
most frequently analyzed representatives of combinatorial optimization prob- 
lems with a lot of solution methodologies and solution manipulation opera- 
tors. Comparing the TSP to other combinatorial optimization problems, the 
main difference is that very powerful problem-specific methods as for example 
the Lin-Kernighan algorithm [LK73] and effective branch and bound meth- 
ods are available that are able to achieve a global optimal solution in very 
high problem dimensions. These high-dimensional problem instances with a 
known global optimal solution are very well suited as benchmark problems 
for metaheuristics as for example GAs. 

The VRP as well as its derivatives, the capacitated VRP (CVRP) and the 
capacitated VRP with time windows (CVRPTW) which will be introduced 
in this chapter, are much closer to practical problem situations in transport 
logistics, and solving them requires the handling of implicit and explicit con- 
straints. There are also no comparable powerful problem-specific methods 
available, and metaheuristics like tabu search and genetic algorithms are con- 
sidered the most powerful problem solving methods for VRP which is a dif- 
ferent but not less interesting situation than handling the TSP problem. 


8.1 The Traveling Salesman Problem 


The TSP is quite easy to state: Given a finite number of cities along with 
the cost of travel between each pair of them, the goal is to find the cheapest 
way of visiting all the cities exactly once and returning to your starting point. 
Usually the travel costs are symmetric. A tour can simply be described by 
the order in which the cities are visited; the data consist of integer weights 
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assigned to the edges of a finite complete graph and the objective is to find a 
Hamiltonian cycle, i.e., a cycle passing through all the vertices, of minimum 
total weight. In this context, Hamiltonian cycles are commonly called tours. 

Already in the early 19th century the TSP appeared in literature [Voi31]. 
In the 1920s, the mathematician and economist Karl Menger [Men27] pub- 
lished it in Vienna; it reappeared in the 1930s in the mathematical circles 
of Princeton. In the 1940s, it was studied by statisticians (Mahalanobis, see 
[Mah40], e.g., and Jessen, see for instance [Jes42]) in connection with an agri- 
cultural application. The TSP is commonly considered the prototype of a 
hard problem in combinatorial optimization. 


8.1.1 Problem Statement and Solution Methodology 
8.1.1.1 Definition of the TSP 


In a formal description the TSP is defined as the search for the shortest 
Hamiltonian cycle of a graph whose nodes represent cities. The objective 
function f represents the length of a route and therefore maps the set S of 
admissible routes into the real numbers R [PS82]: 


f:S-R 

The aim is to find the optimal tour s* € S such that f(s*) < f (sk), Ysk E€ S. 
In order to state the objective function f we have to introduce a distance 
matrix [dij], di; € R* whose entries represent the distance from a city i to a 
city j. In that kind of representation the cities are considered as the nodes of 
the underlying graph. If there is no edge between two nodes, the distance is 
set to infinity. 

Using the notation given in [PS82], that 7;,(2) represents the city visited 
next after city i in a certain tour sz, the objective function is defined as 


f (sk) = SS din, (i) (8.1) 
i=l 


By this definition the general asymmetric TSP is specified. By means of 
certain constraints on the distance matrix it is possible to define several vari- 
ants of the TSP. A detailed overview about the variants of the TSP is given 
in [LLRKS85]. The most important specializations consider symmetry, the 
triangle-inequality, and Euclidean distances: 


Symmetry 


A TSP is defined to be symmetric if and only if its distance matrix is 
symmetric, i.e., if 
diy = dyi,Vt,7 €1,...,n (8.2) 
If this set of equalities is not satisfied for at least one pair (i, j), for example 
if “one-way streets” occur, we denote the problem as an asymmetric TSP. 
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The Triangle Inequality 


Symmetric as well as asymmetric TSPs can, but don’t necessarily have to 
satisfy the triangle inequality: 


dij < (dik + dey), V(t, j,k € 1,...,n) (8.3) 


i.e., that the direct route between two cities must be shorter than or as long 
as any route including another node in between. 
A reasonable violation of the triangle inequality is possible especially when the 
entries in the distance matrix are interpreted as costs rather than as distances. 


Euclidean Distances 


As a rather important subset of symmetric TSP satisfying the triangle in- 
equality we consider the so-called Euclidean TSP. For the Euclidean TSP it 
is mandatory to specify the coordinates of each node in the n-dimensional 
Euclidean space. For the 2-dimensional case the entries dj; of the distance 
matrix are consequently given by the Euclidean distance 


dij = y/ (zi — 23)? + (Yi — Ys)? (8.4) 


whereby x; and y; denote the coordinates of a certain city i. 

In contrast to most problems that occur in practice, many TSP benchmark 
tests use Euclidean TSP instances. Anyway, GA-based metaheuristics do not 
take advantage of the Euclidean structure and can therefore also be used for 
Non-Euclidean TSPs. 


8.1.1.2 Versions of the TSP 


Motivated by certain situations appearing in operational practice, some 
more variants of the TSP have emerged. Appreciable standardizations that 
will not be taken into further account within the scope of this book are the 
following ones: 


Traveling Salesman Subtour Problems (TSSP) 


In contrast to the TSP not all cities have to be visited in the context of 
the TSSP; only those cities are visited that are worth being visited which 
implies the necessity of some kind of profit function in order to decide if the 
profit is higher than the travel expenses. Vice versa, depending on the actual 
implementation, this can also be realized by the introduction of penalties for 
not visiting a certain node (city). 


Postman Problems 


For postman problems (e.g., [Dom90]) not certain sets of nodes (cities) have 
to be visited but rather given sets of edges (which can be interpreted as streets 
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of houses) have to be passed at least once with the goal to minimize the total 
route length. Therefore, the aim is a suitable selection of the edges to be 
passed in a certain order for obtaining minimal cost. 


Time Dependent TSP 


In time dependent TSPs the cost of visiting a city j starting from a city 
i does not only depend on d;i; but also on the position in the total-route or, 
even more general, on the point of time a certain city is visited [BMR93]. 


Traveling Salesman Problem with Time Windows (TSPTW) 


Like the TSP, the TSPTW is stated as finding an optimal tour for a set of 
cities where each city has to be visited exactly once. Additionally to the TSP, 
the tour must start and end at a unique depot within a certain time window 
and each city must be visited within its own time window. The cost is usually 
defined by the total travel distance and/or by the total schedule time (which 
is defined as the sum of travel time, waiting time, and service time) [Sav85]. 


8.1.1.3 Review of Optimal Algorithms 
Total Enumeration 


In principle, total enumeration is applicable to all integer optimization prob- 
lems with a finite solution space: All points of the solution space S are evalu- 
ated by means of an objective-function storing the best solution so far. As the 
TSP has a worst case complexity of O(n!), total enumeration is only applicable 
to very small problem instances. For example, even for a rather small and sim- 
ple 30-city symmetric TSP one would have to consider fu = goy possible 
solutions which would require a computational time of about 1.4 x 101? years 
assuming the use of a very powerful computer which can evaluate 100,000 
million routes per second. 


Integer Programming 


In order to apply integer programming to the TSP it is mandatory to in- 
troduce a further n x n matrix X = [x;;j] with xi; € {0,1} where z;; indicates 
whether or not there is a connection from city i to city j. Thus, the optimiza- 
tion problem can be stated in the following way: 

Find 


such that 
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ay =1;Wi€ {1,...,n 8.6 
j 


j=l 
X Ory = 1; Vj € {1,...,n} (8.7) 
t=1 

Tij > 0; Vi, j € {1,...,n} (8.8) 


These constraints ensure that each city has exactly one successor and is 
the predecessor of exactly one other city. The representation given above is 
also called an assignment problem and has firstly been applied to the TSP by 
Dantzig [DR59]. However, the assignment problem alone does not assure a 
unique Hamiltonian cycle, i.e., it is also possible that two or more subcycles 
exist which does not specify a valid TSP. 

Hence, for the TSP it is necessary to state further conditions in order to 
define an assignment problem without subcycles, and therefore the integer 
property of the assignment problem does not hold any more. Similar to linear 
programming, in integer programming the admissible solution space can be 
restricted by the given constraints — but the corners of the emerging polyhe- 
dron won’t represent valid solutions in general. In fact, only a rather small 
number of points inside the polyhedron will represent valid solutions of the 
integer program. 

As described in [Gom63] Gomory tried to overcome this drawback by in- 
troducing the cutting-plane method that introduces virtual constraints, the 
so-called cutting-planes, in order to ensure that all corners of the convex poly- 
hedron are integer solutions. The crux in the construction of suitable cutting 
planes is that it requires a lot of very problem-specific knowledge. 

Grotschel’s dissertation [Gr677] was one of the first contributions that con- 
sidered a special TSP instance in detail and a lot of articles about the solution 
of specific TSP-benchmark problems have since then been published (as for 
example in [CP80]) with problem-specific cutting-planes. 

Unfortunately, the methods for constructing suitable cutting-planes are far 
away from working in an automated way and require a well versed user. There- 
fore, the main area of application of integer programming for the TSP is the 
exact solution of some large benchmark problems in order to get reference 
problems for testing certain heuristics. 


8.1.2 Review of Approximation Algorithms and Heuristics 


During the last four decades a variety of heuristics for the TSP has been 
published; Lawler et al. have given a comparison of the most established ones 
in [LLRKS85]. Operations research basically distinguishes between methods 
that are able to construct new solutions routes, called route building heuristics 
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or construction heuristics, and methods that assume a certain (valid) route 
in order to improve it, which are understood as route improving heuristics. 


Nearest Neighbor Heuristics 


The nearest neighbor algorithm [LLRKS85] is a typical representative of 
a route building heuristics. It simply considers a city as its starting point 
and takes the nearest city in order to build up the Hamiltonian cycle. At the 
beginning this strategy works out quite well whereas adverse stretches have 
to be inserted when only a few cities are left. 


Figure 8.1 shows a typical result of nearest neighbor heuristics applied to a 
TSP instance that demonstrates its drawbacks. 


FIGURE 8.1: Exemplary nearest neighbor solution for a 51-city TSP instance 
({CE69]). 
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Partitioning Heuristics 


Applying partitioning heuristics to the TSP means splitting the total num- 
ber of cities into smaller sets according to their geographical position. The 
emerging subsets are treated and solved as independent TSPs and the solution 
of the original TSP is given by a combination of the partial solutions. 

The success rate of partitioning heuristics for TSPs very much depends on 
the size and the topological structure of the TSP. Partitioning heuristics do 
not perform well in the general case. Particularly suitable for partitioning 
heuristics are only higher dimensional Euclidean TSPs with rather uniformly 
distributed cities. Algorithms based on partitioning have been presented in 
[Kar77], [Kar79], and [Sig86]. 


Local Search 


Typical representatives of route improving heuristics are the so-called k- 
change methods that examine a k-tuple of edges of a given tour and test 
whether or not a replacement of the tour segments effects an improvement of 
the actual solution quality. 

A lot of established improvement methods are based upon local search 
strategies also causing the nomenclature “neighborhood search.” The basic 
idea is to search through the surroundings of a certain solution s; in order to 
replace s; by an eventually detected “better” neighbor sj. 

The formal description of the neighborhood structure is given by N C SxS 
with S denoting the solution space. The choice of N is up to the user with 
the only restriction that the corresponding graph has to be connected and 
undirected, i.e., the neighborhood structure should be designed in a way that 
any point in the solution space is reachable and that s; being a direct neighbor 
of sj implies s; being a direct neighbor of s;. 

Mainly for reasons of implementation the following formal definition has 
been established: 

NCSxS 


with 
N(s;) := {s; E€ S | (si, sj) E N}. 


Choosing a neighborhood of larger size can cause problems concerning compu- 
tational time whereas a rather small neighborhood increases the probability 
of getting stuck in a local optimum [PS82]. The search process for a better 
solution in the neighborhood is performed successively until no better solution 
can be detected. Such a point is commonly referred to as a local minimum 
with respect to a certain neighborhood structure and the neighborhood struc- 
ture is termed definite if and only if any local optimum coincides with the 
global optimum (optima) s* due to the neighborhood. Unfortunately, the 
verification of a definite neighborhood itself mostly is a NP-complete problem 
[PS82]. 
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The 2-Opt Method 


The most popular local edge-recombination heuristic is the 2-change re- 
placement of two edges. In this context the neighborhood is defined in the 
following way: 


A tour s; is adjacent (neighboring) to a tour s; if and only if sj 
can be derived from s; by replacing two of s;’s edges. 


FIGURE 8.2: Example of a 2-change for a TSP instance with 7 cities. 


Numbering the cities in the order they are visited with c,...cp, yields the 
following representation of two adjacent routes: 


(cy -ee CUC{41 ~~. CjCj+1 Cn) g= (cı 2 CiCj .. . Ci+1Ĉj+1 -+ Cn) 


Figure 8.2 illustrates one possible 2-change operation for a small TSP in- 
stance. In this example (assuming that the left tour is transformed to the 
right tour) the two edges 5 — 7 and 6 — 1 are removed and the edges 5 — 6 and 
7 — 1 are inserted in order to reestablish a valid tour. 

Any route sj can be derived from any other route s; by at most (n — 2) 2- 
change operations [AK89] and any solution s; has exactly = a(n) D neighboring 
(adjacent) solutions. For a symmetrical TSP (as indicated in TE example of 
Figure 8.2) the number of neighboring solutions reduces to 2a- 2) (GS90)]. 

Already half a century ago Croes [Cro58] published a solution technique 
for the TSP which is based upon the 2-change method: The algorithm has to 
check if an existing route s; can be upgraded by the 2-change operator and 
perform it where applicable. This process is repeated until f(s;) > f(s;) for 
all s; that can be generated by using 2-change and the resulting route is called 
2-optimal. Unfortunately, it is very unlikely that a 2-optimal tour is globally 
optimal. 
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The 3-Opt Method 


The 3-opt method is very similar to the 2-opt method with the exception 
that not two but three edges are replaced. Considering a route with n nodes 


nn=V=?) different 3-change operations are possible [GS90]. 


being involved 5 


(c 22) CiCi+l -- - CjCj+1 -- - CkCk+1 Cn) — > (c 22 CiCj+1 - + - CkCi+1 » » . CjCk+1 Cn) 


FIGURE 8.3: Example of a 3-change for a TSP instance with 11 cities. 


Figure 8.3 illustrates one possible 3-change operation for a small TSP in- 
stance. In this example (assuming that the left tour is transformed to the 
right tour) the three edges 4—9, 5— 10, and 8— 11 are removed and the edges 
4—5, 8 — 9, and 10 — 11 are inserted in order to reestablish a valid tour. 

Also already half a century ago, Bock [Boc58] was the first one who applied 
the 3-opt method to the TSP. Similar as for the 2-opt method the final route 
was derived by successively applying 3-change operations terminates to a so- 
called 3-optimal solution. The probability to obtain a global optimal solution 
using the 3-opt method was empirically detected to be about 27710 [Lin65]. 


The k-opt Method 


In principle, the k-opt method is the consequential generalization of the 
methods described previously: k edges are replaced in a k-change neighbor- 
hood structure and a route is called k-optimal if it cannot be improved by 
any k-change. If k = n then it is proven that the k-optimal solution is the 
global solution [PS82]. But as the complexity of locating a k-optimal solution 
is given by O(n”) [GBD80], the computational effort is still enormous even for 
rather small values of k. A very efficient implementation for Euclidean trav- 
eling salesman problems is the Lin-Kernighan algorithm [LK73]. An efficient 
implementation is given in [Hel00]. 
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8.1.3 Multiple Traveling Salesman Problems 


The multiple traveling salesman problem (MTSP) describes a generalization 
of the TSP in the sense that there is not just one traveling salesman performing 
the whole tour but rather a set of salesmen, each serving a subset of the cities 
involved. Therefore, one of the cities has to be selected as the location for the 
depot representing the starting as well as the end point of all routes. So the 
MTSP is a combination of the assignment problem and the TSP. Usually a 
tour denotes the set of cities served by one traveling salesman and the number 
of tours is specified by m. 

In literature there are mainly two definitions of the MTSP: 


e In Bellmore’s definition (given in [BH74]) the task is to find exactly m 
tours in such a way that each city in a tour and the depot are visited 
exactly once with the objective to minimize the total way. 


e The second definition of the MTSP (as given in [Ber98], e.g.) does not 
postulate exactly but at most m routes and the goal is to minimize the 
total distance if each tour includes the depot and each city is visited 
exactly once in some tour. 


At first sight the second definition seems more reasonable because there is no 
comprehensible reason why one should consider m tours if there is a solution 
involving only (m—1) tours, for example. Still, one has to be aware of the fact 
that in the second definition with no additional constraints the solution will 
always be a single tour including all cities for any distance matrix fulfilling 
the triangle inequality. 


8.1.4 Genetic Algorithm Approaches 


Sequencing problems as for example the TSP are among the first applica- 
tions of genetic algorithms, even if the classical binary representation as sug- 
gested in [Gol89] is not particularly suitable for the TSP because crossover 
hardly ever produces valid descendants. 

In the following we will discuss some GA coding standards for the TSP as 
proposed in the relevant GA and TSP literature: 


8.1.4.1 Problem Representations and Operators 


Adjacency Representation 


In the adjacency representation [GGRG85] a tour is represented as a list of 
n cities where city 7 is listed in position t if and only if the tour leads from 
city i to city j. Thus, the list 


(T Gee eae de (> d 


represents the tour 
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In the adjacency representation any tour has its unique adjacency list rep- 
resentation. An adjacency list may represent an illegal tour. For example, 


3 5 7 6 2 4 1 8) 
represents the following collection of cycles: 
1-3-7, 2-5, 4-6, 8 


Obviously, the classical crossover operator(s) (single or n-point crossover) 
are very likely to return illegal tours for the adjacency representation. There- 
fore, the use of a repair operator becomes necessary. 

Other operators for crossover have been defined and investigated for this 
kind of representation: 


e Alternating Edge Crossover : 

The alternating edge crossover [GGRG85] chooses an edge from the first 
parent at random. Then, the partial tour created in this way is extended 
with the appropriate edge of the second parent. This partial tour is 
extended by the adequate edge of the first parent, etc. By doing so, 
the partial tour is extended by choosing edges from alternating parents. 
If an edge is chosen which would produce a cycle into the partial tour, 
then the edge is not added; instead, the operator randomly selects an 
edge from the edges which do not produce a cycle. 

For example, the result of an alternating edge crossover of the parents 


(238 79 14 5 6) (7 5 169 2 8 4 3) 
could for example be 
(25 8 79 1 6 4 3) 


The first edge chosen is (1 — 2) included in the first parent’s genetic ma- 
terial; the second edge chosen, edge (2 — 5), is selected from the second 
parent, etc. The only randomly introduced edge is 7 — 6 instead of 7—8. 
Nevertheless, experimental results using this operator have been dis- 
couraging. The obvious explanation seems to be that good subtours are 
often disrupted by the crossover operator. Ideally, an operator ought 
to promote longer and longer high performance subtours; this has mo- 
tivated the development of the following operator. 


e Subtour Chunks Crossover: 
Using the subtour chunks crossover [GGRG85] an offspring is con- 
structed from two parent tours in the following way: A random subtour 
of the first parent is chosen, and this partial tour is extended by choosing 
a subtour of random length from the second parent. Then, the partial 
tour is extended by taking subtours from alternating parents. If the use 
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of a subtour, which is selected from one of the parents, would lead to 
an illegal tour, then it is not added; instead, an edge is added which is 
randomly chosen from the edges that do not produce a cycle. 


e Heuristic Crossover: 

The heuristic crossover [GGRG85] starts with randomly selecting a city 
for being the starting point of the offspring’s tour. Then, the edges 
starting from this city are compared and the shorter of these two edges 
is chosen. Next, the city on the other side of the chosen edge is selected 
as a reference city. The edges which start from this reference city are 
compared and the shortest one is added to the partial tour, etc. If at 
some stage a new edge introduces a cycle into the partial tour, then the 
tour is extended with an edge chosen at random from the remaining 
edges which do not introduce cycles. 


The main advantage of the adjacency representation is that it allows 
schemata analysis as described in [OSH87], [GGRG85], [Mic92]. Unfortu- 
nately, the use of all operators described above lead to poor results; in partic- 
ular, the experimental results with the alternating edge operator have been 
very poor. This is because this operator often destroys good subtours of the 
parent tours. The subtour chunk operator which chooses subtours instead of 
edges from the parent tours performs better than the alternating edge op- 
erator. However, it still has quite a low performance because it does not 
take into account any information available about the edges. The heuristic 
crossover operator performs far better than the other two operators; still, the 
performance of the heuristic operator is not remarkable either [GGRG85]. 


Ordinal Representation 


When using the ordinal presentation as described in [GGRG85] a tour is 
also represented as a list of n cities; the i-th element of the list is a number 
in the range from 1 to n —i+1, and there an ordered list of cities serving as 
a reference point is also used. 

The easiest way to explain the ordinal representation is probably by giving 
an example. Assume, for instance, that the ordered list L is given as 


L=(123 45 6 7). 


Now the tour 
1-—-2-—7-5-6-3-4 


in ordinal representation is given as 
T=(1 15 3 3 1 1). 


This can be interpreted in the following way: The first member of T is 1, which 
means that in order to get the first city of the tour we take the first element of 
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the list L and remove it from the list. So the partial tour is 1 at the beginning. 
The second element of T is also 1 so the second city of the route is 2 which is 
situated at the first position of the reduced list. After removing city 2 from 
the list, the next city to add is in position 5 according to T, which is city 7 in 
the again reduced list L, etc. If we proceed in this way until all elements of 
L are removed, we will finally find the tour 1 — 2 — 7 — 5 — 6 — 3 — 4 with the 
corresponding ordinal representation T = (1 1 5 3 3 1 1). The main 
advantage of this rather complicated ordinal representation lies in the fact 
that the classical crossover can be used. This follows from the fact that the i- 
th element of the tour representation is always a number in the range from 1 to 
n—i+1. It is self-evident that partial tours to the left of the crossover point 
do not change whereas partial tours to the right of the crossover point are 
split in a quite random way and, therefore, the results obtained using ordinal 
representation have been generally poor (approximately in the dimension of 
the results with adjacency representation) [GGRG85], [LKM+99]. 


Path Representation 


The path representation is probably the most natural representation of a 
tour. Again, a tour is represented as a list of n cities. If city i is the j-th 
element of the list, city 7 is the j-th city to be visited. Hence the tour 


1—2-—7-5-6-3-4 


is simply represented by 
(1 2 7 5 6 3 4). 


Since the classical operators are not suitable for the TSP in combination with 
the path representation, other crossover and mutation operators have been 
defined and discussed. As this kind of representation will be used for our 
experiments in Chapter 10, we shall now discuss the corresponding operators 
in a more detailed manner: 


e Partially Matched Crossover (PMX): 
The partially matched crossover operator has been proposed by Gold- 
berg and Lingle in [GL85]. It passes on ordering and value information 
from the parent tours to the offspring tours: A part of one parent’s string 
is mapped onto a part of the other parent’s string and the remaining 
information is exchanged. 


Let us for example consider the following two parent tours 
(a bc de fghii jg) and c f gajbdie hi). 


The PMX operator creates an offspring in the following way: First, it 
randomly selects two cutting points along the strings. As indicated in 
Figure 8.4, suppose that the first cut point is selected between the fifth 
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BLIP @IWL) reer 


DeSCUpRRTE Parent 2 


(JUM2IDWWHEID ors 
e ee 


FIGURE 8.4: Example for a partially matched crossover (adapted from 
[Wen95]). 


and the sixth element and the second one between the eighth and ninth 
string element. The substrings between the cutting points are called 
the mapping sections. In our example they define the mappings f < b, 
g => d, and h «i. Now the mapping section of the first parent is copied 
into the offspring resulting 


( f gh ) 


Then the offspring is filled up by copying the elements of the second 
parent; if a city is already present in the offspring then it is replaced 
according to the mappings. Hence, as illustrated in Figure 8.4, the 
resulting offspring is given by 


(c bd aj fghe it) 


The PMX operator therefore tries to keep the positions of the cities in 
the path representation; these are rather irrelevant in the context of the 
TSP problem where the most important goal is to keep the sequences. 
Thus, the performance of this operator for the TSP is rather poor, 
but we can easily imagine that this operator could perform well for 
other combinatorial optimization problems like the machine scheduling 
problem even if it has not been developed for such problem instances. 


e Order Crossover (OX): 
The order crossover operator has been introduced by Davis in [Dav85]. 
For the first time it employs the essential property of the path represen- 
tation, that the order of cities is important and not their position. 


It constructs an offspring by choosing a subtour of one parent preserving 
the relative order of the other parent. For example let us consider the 
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FIGURE 8.5: Example for an order crossover (adapted from [Wen95]). 


following two parent tours 
(a bc de fghi jz) and c f gah badie jf) 


and suppose that we select the first cut point between the fifth and sixth 
position and the second cut point between the eighth and ninth position. 
For creating the offspring the tour segment between the cut points of 
the first parent is copied into it, which gives 


( f gh ) 


Then the selected cities of the first parent’s tour segment are canceled 
from the list of the second parent and the blank positions of the child 
are filled with the elements of the shortened list in the given order (as 
illustrated in Figure 8.5), which gives 


(c a b di fghe J) 


Since a much higher number of edges is maintained, the results are 
unequivocally much better compared to the results achieved using the 
PMX operator. 


e Cyclic Crossover (CX): 
The cyclic crossover operator, proposed by Oliver et al. in [OSH87], 
attempts to create an offspring from the parents where every position is 
occupied by a corresponding element from one of the parents. 


For example, again consider the parents 
(a bc de fghi jz) and c f gah badie jf) 


and choose the first element of the first parent tour as the first element 
of the offspring. As node c can no longer be transferred to the child 
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FIGURE 8.6: Example for a cyclic crossover (adapted from [Wen95]). 


by D s 
‘ H ` ` H H ` 
k H ‘ ` H H $ 
‘ H \ y I H ‘ 
\ 1 ` ` 1 1 ‘ 
K 1 ` iN 1 1 ` 
‘ 1 \ v 1 1 ‘ 
K 1 ` ‘ 1 1 K 
p 1 \ ` 1 1 \ 


from the second parent, we visit node c in the first parent and transfer 
it to the offspring which makes it impossible for the first parent’s node 
g to occupy the same position in the child. Therefore, g is taken from 
parent 2 and so on. This process is continued as long as possible, i.e., 
as long as the selected node is not yet a member of the offspring. In 
our example this is the case after four successful copies resulting in the 
following partial tour: 


(a cd g ). 


The remaining positions can then simply be taken from one of the two 
parents; in this example, which is graphically illustrated in Figure 8.6, 
these are taken from from parent 2). 

Oliver et al. [OSH87] concluded from theoretical and empirical results 
that the CX operator gave slightly better results than the PMX opera- 
tor. Anyway, the results of both position preserving operators CX and 
PMX are definitely worse than those obtained with OX which fortifies 
our basic assumption that in the context of the TSP it is much more 
important to keep sequences rather than positions. 


e Edge Recombination Crossover (ERX): 
Even if the main aim of the OX operator is to keep the sequence of at 
least one parent there are still quite a lot of new edges in the offspring.! 
Whitley et al. [WSF89] tried to overcome this drawback and came up 
with the edge recombination crossover which has been designed with the 
objective of keeping as many edges defined by the parents as possible. 
Indeed it can be shown that about 95% —99% of each child’s edges occur 
in at least one of the two respective parents [WSF89]. Therefore, the 
ERX operator for the first time represented an almost mutation-free 


lIn the present contents “new” means that those edges do not occur in any of the two 
parents. 
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crossover operator, but unfortunately this can only be achieved by a 
quite complicated and time consuming procedure: 

The ERX operator is an operator which is suitable for the symmetrical 
TSP as it assumes that only the values of the edges are important and 
not their direction. Pursuant to this assumption, the edges of a tour 
can be seen as the carriers of heritable information. Thus, the ERX op- 
erator attempts to preserve the edges of the parents in order to pass on 
a maximum amount of information to the offspring whereby the break- 
ing of edges is considered as an unwanted mutation. The problem that 
usually occurs with operators that follow an edge recombination strat- 
egy is that they often leave cities without a continuing edge [GGRG85] 
whereby these cities become isolated and new edges have to be intro- 
duced. 

The ERX operator tries to avoid this problem by first choosing cities 
that have few unused edges; still, there has to be a connection with a city 
before it can be selected. The only edge that the ERX operator may fail 
to enforce is the edge from the final city to the initial city which inhibits 
the ERX operator of working totally mutation free. When constructing 
an offspring (descendant), we first have to construct a so-called “edge 
map” which gives the edges for each of the parents that start or finish in 
it. Then, the ERX works according to the following algorithm [WSF89] 


1. Choose the initial city from one of the two parent tours. It might 
be chosen randomly or according to criteria outlined in step 4. This 
is the “current city.” 


2. Remove all occurrences of the “current city” from the left hand 
side of the edge map. 


3. If the current city has entities in its edge list go to step 4; otherwise, 
go to step 5. 


4. Determine which of the cities in the edge-list of the current city has 
the fewest entities in its own edge list. The city with the fewest 
entities becomes the “current city”; ties are broken at random. 
Proceed with step 2. 


5. If there are no remaining unvisited cities, then terminate; otherwise 
randomly choose an unvisited city and continue with step 2. 


We will explain the functioning of ERX on the basis of a small example 
which has also been used in [Wen95]. Consider for instance the tours 


(1234567 8 9) and (41287 69 3 5) 


The edge map for our example parent tours is given in Table 8.1. 
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Table 8.1: Exemplary edge map of the parent tours for an ERX operator. 


city | connected cities 


T 9,2,1 
2 1, 3,8 
3 2,4,9,5 
4 3,5,1 
5 4,6,3 
6 5,9,7 
7 6,8 
8 7,9,2 
9 8,1,6,3 


According to the procedure given before, we select city 1 as the initial 
city. The edge map of city one shows that cities 9, 2, and 4 are the 
candidates for becoming the next current city. As city 9 actually has 
4 (8,1,6,3) further links, we have to decide between the cities 2 and 4 
which both have 3 further links. Choosing city 4 as the next current 
city we obtain 3 and 5 as the next candidates, etc. Proceeding in that 
way, we might finally end up with the offspring tour 


14567823 9) 


which for this special case of our example is totally mutation free, i.e., 
all edges of the offspring occur in at least one of the two parents. 


As common sequences of the parent tours are not taken into account by 
the ERX operator an enhancement, commonly denoted as “enhanced 
edge recombination crossover (EERX)”, has been developed [SMM*91]. 
The EERX additionally gives priority to those edges starting from the 
current city which are present in both parents. 


For mutation in the context of applying genetic algorithms to the TSP, the 
2 — change and 3 — change techniques have turned out to be very successful 
[WSF89]. A comprehensive review of mutation operators for the TSP is given 
in [LKM?T99]. In the following some of the most important mutation operators 
are described which are also applied in the experimental part of this book: 


e Exchange Mutation: 
The exchange mutation operator selects two cities of the tour randomly 
and simply exchanges them. In various publications the exchange mu- 
tation operator is also referred to as swap mutation, point mutation, 
reciprocal exchange, or order-based mutation [LKM™ 99]. 


e Insertion Mutation: 
The insertion mutation operator [Mic92] randomly chooses a city, re- 
moves it from the tour and inserts it at a randomly selected place. An 
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alternative naming for insertion mutation is position-based mutation 
[LKM*99]. 


e Simple Inversion Mutation: 
The simple inversion mutation operator [Hol75], which is used in the 
TSP experiments of the book, randomly selects two cut points and sim- 
ply reverses the string between them. 


e Inversion Mutation: 
The inversion mutation operator [Fog93] randomly selects a subtour, 
removes it, and inserts it in reverse order at a randomly chosen position. 
An alternative naming for inversion mutation is cut-inversion mutation 


[LKM*+99]. 


8.2 The Capacitated Vehicle Routing Problem 


In principle, the vehicle routing problem (VRP) is am-TSP where a demand 
is associated with each city, and the salesmen are interpreted as vehicles each 
having the same capacity. A survey of the VRP is for example given in [Gol84]. 
During the later years a number of authors have “renamed” this problem the 
capacitated vehicle routing problem (CVRP). The sum of demands on a route 
cannot exceed the capacity of the vehicle assigned to this route; as in the m- 
TSP we want to minimize the sum of distances of the routes. Note that the 
CVRP is not purely geographic since the demand may be constraining. 

The CVRP is the basic model for a number of vehicle routing problems: 

If a time slot, in which customers have to be visited, is added to each 
customer, then we get the “vehicle routing problem with time windows” 
(VRPTW or CVRPTW). In addition to the capacity constraint, a vehicle 
now has to visit a customer within a certain time frame given by a ready 
time and due date. It is generally allowed that a vehicle may arrive before 
the ready time (in this case it simply waits at the customer’s place), but it is 
forbidden to arrive after the due date. However, some models allow early or 
late servicing but with some form of additional cost or penalty. These models 
are denoted “soft” time window models (as for example in [Bal93}). 

If customers are served from several depots, then the CVRP becomes the 
“multiple depots vehicle routing problem” (MDVRP); in this variant each 
vehicle starts and returns to the same depot. The problem can be solved by 
splitting it into several single depot VRP problems if such a split can be done 
effectively. Another variant of the CVRP is the “vehicle routing problem with 
length constraints” (VRPLC or CVRPLC). Here each route is not allowed to 
exceed a given distance; this variant is also known as the “distance constrained 


140 Genetic Algorithms and Genetic Programming 


vehicle routing problem” (DVRP) in case there are no capacity restrictions 
and the length or cost is the only limiting constraint. 

In the “split delivery” model the demand of a customer is not necessarily 
covered by just one vehicle but may be split between two or more. The 
solutions obtained in a split delivery model will always be at least as good 
as for the “normal” CVRP and we often might be able to utilize the vehicles 
better and thereby save vehicles. 

Finally we shall also mention the “pickup and delivery” variant where the 
vehicles not only deliver items but also pick up items during the routes. This 
problem can be varied even more according to whether the deliveries must be 
completed before starting to pick up items or the two phases can be inter- 
leaved. 


All of these problems have in common that they are “hard” to solve. For 
the VRPTW exact solutions can be found within reasonable time for some 
instances including up to about 100 customers. A review of exact methods 
for the VRPTW is given in Subsection 8.2.1.2. 

Often the number of customers combined with the complexity of real-life 
data does not permit solving the problems exactly. In these situations it 
is commendable to apply approximation algorithms or heuristics. Both can 
produce feasible, but not necessarily optimal solutions; whereas a worst-case 
deviation is known for approximation algorithms, nothing is known a priori 
for heuristics. Some of these inexact methods will be reviewed in Subsection 
8.2.1.3. 

If the term “vehicle” is interpreted more loosely, numerous scheduling prob- 
lems can also be modeled as CVRPs or VRPTWs. An example is the following 
one: For a single machine we want to schedule a number of jobs for which we 
know the flow time and the time to go from one running job to the next one. 
This scheduling problem can be regarded as a VRPTW with a single depot, a 
single vehicle, and the customers representing the jobs. The cost of changing 
from one job to another is equal to the distance between the two customers, 
and the time it takes to perform an action is the service time of the job. For a 
general description of the connection between routing and scheduling see for 
instance [vB95] or [CL98]. 


8.2.1 Problem Statement and Solution Methodology 
8.2.1.1 Definition of the CVRP 


In this section we present a mathematical formulation of the general ve- 
hicle routing problem with time windows (VRPTW or CVRPTW) as the 
(capacitated) vehicle routing problem ((C)VRP) is fully included within this 
definition under special parameter settings. The formulation is based upon 
the model defined by Solomon [SD88}. 


In this description the VRPTW is given by a fleet of homogeneous vehicles 
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VY, a set of customers C, and a directed graph G. The graph consists of 
|C| + 2 vertices, whereby the customers are denoted as 1,2,...,n and the 
depot is represented by the vertices 0 (the “driving-out depot”) and n + 1 
(the “returning depot”). The set of vertices 0,1,...,2+ 1 is denoted as M; 
the set of arcs A represents connections between customers and between the 
depot and customers, where no arc terminates in vertex 0 and no arc originates 
from vertex n + 1. With each arc (i, j), where i Æ j, we associate a cost Cij 
and a time tij, which may include service time at customer t. 

Each vehicle j has a capacity qj and each customer i a demand d;. Further- 
more, each customer 7 has a time window [a;, b;]; a vehicle can arrive before 
a;, but service does not start before a;); however, the vehicle must arrive at 
the customer before b;. In the general description, the depot also has a time 
window [ao, bo] = [an+1, bn+1], the scheduling horizon. Vehicles may not leave 
the depot before ag and must be back before or at time bn+1.- 

It is postulated that q, ai, bi, di, and cij are nonnegative integers, while the 
tij values are assumed to be positive integers. Furthermore, it is assumed 
that the triangular inequality is satisfied for both c;; values as well as the tij 
values. 

This model contains two sets of decision variables, namely x and s. For 
each arc (i, j), where i 4 j,i An+1,7 #0, and each vehicle k we define x; ;;, 
in the following way: 


Sanat 0 , if vehicle k does not drive from vertex 7 to vertex j 
uk) 1 , if vehicle k drives from vertex i to vertex j 


The decision variable sig is defined for each vertex i and each vehicle k 
denoting the time vehicle k starts to service customer i. If the given vehicle 
k doesn’t service customer i, then s;, does not mean anything. We assume 
ao = 0 and therefore so, = 0 for all k. 


The goal is to design a set of routes with minimal cost, one for each vehicle, 
such that 


e each customer is serviced exactly once, 
e every route originates at vertex 0 and ends at vertex n + 1, and 


e the time windows and capacity constraints are complied with. 


The mathematical formulation for the VRPTW is stated as follows [Tha95]: 


142 Genetic Algorithms and Genetic Programming 


min 5 5 > CijLijk St (8.9) 


kEVieN jEN 
XOY tie =1 VEC (8.10) 
keV jEN 
ieC  jEN 
5 Tojk =1 VkeVy (8.12) 
JEN 
Ņ\ tink — D> tnit =0 VREC,VREV (8.13) 
ieN JEN 
So tink =1 VkEV (8.14) 
iEN 
Sik + tij; — K(1 — tijk) < Sjk Vi, j EN,YkEeEV (8.15) 
Tijk € {0,1} Vi, j EN,YkeV (8.17) 


The constraint (8.10) states that each customer is visited exactly once, and 
(8.11) implies that no vehicle is loaded with more than its capacity allows. 
Equations (8.12), (8.13), and (8.14) ensure that each vehicle leaves depot 0, 
leaves again after arriving at a customer, and finally arrives at the depot n+1. 
Inequality (8.15) states that a vehicle k cannot arrive at j before sip + tij if it 
is traveling from 7 to j, whereby K is a large scalar. Finally, constraints (8.16) 
ensure that the time windows are adhered to and (8.17) are the integrality 
constraints. In this definition an unused vehicle is modeled by driving the 
empty route (0,7 + 1). 

As already mentioned earlier, the VRPTW is a generalization of TSP and 
CVRP; in case the time constraints (8.15) and (8.16) are not binding, the 
problem becomes a CVRP. This can be achieved by setting a; = 0 and b; = M 
(where M is a large scalar) for all customers i. In this context it should 
be noted that the time variables enable us to formulate the CVRP without 
subtour elimination constraints. If only one vehicle is available, then the 
problem is in fact a TSP. 


8.2.1.2 Exact Algorithms 


Almost all papers proposing an exact algorithm for solving the CVRP or 
the VRPTW use one or a combination of the following three principles: 


e Dynamic programming 
e Lagrange relaxation-based methods 


e Column generation 
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The dynamic programming approach for the VRPTW was presented in 
[KRKT87]. This paper is inspired by an earlier publication [CMT81] where 
Christofides et al. used the dynamic programming paradigm to solve the 
CVRP. 

Lagrange relaxation-based methods have been published in a number of 
papers using slightly different approaches. There are approaches applying 
variable splitting followed by Lagrange relaxation as well as variants apply- 
ing the k-tree approach followed by Lagrange relaxation. In [FJM97] Fisher 
et al. presented a shortest path approach with side constraints followed by 
Lagrangean relaxation. The main problem, which consists of finding the op- 
timal Lagrange multipliers that yield the best lower bounds, is solved by a 
method using both subgradient optimization and a bundle method. Kohl et 
al. [KM97] managed to solve problems of 100 customers from the Solomon 
test cases; among them some previously unsolved problems. 

If a linear program contains too many variables to be solved explicitly, it 
is possible to initialize the linear program with a smaller subset of variables 
and compute a solution of this reduced linear program. Afterwards one has to 
check whether or not the addition of one or more variables, currently not in 
the linear program, might improve the solution; this check is commonly done 
by the computation of the reduced costs of the variables. An introduction to 
this method (commonly called “column generation method” ) can for example 
be found in [BJN* 98}. 


Again, similar as for the TSP it takes well versed users in order to benefit 
from the mentioned exact algorithms - especially if they are applied to large 
problems. 

Therefore, the main area of application of exact methods in the context of 
CVRP is to locate the exact solution of some large benchmark problems in 
order to get some reference-problems for testing certain heuristics which can 
easily be applied to practical problems of higher dimension. 


8.2.1.3 Approximation Algorithms and Heuristics 


The field of inexact algorithms for the CVRP has been very active - far 
more active than that of exact algorithms, and a long series of papers has 
been published over the recent years. Heuristic algorithms that build a set 
of routes from scratch are typically called route-building heuristics, while an 
algorithm that tries to produce an improved solution on the basis of an already 
available solution is denoted as route-improving. 


The Savings Heuristic 


The savings heuristic has been introduced in [CW64]. At the beginning of 
the algorithm, each of the n customers (cities) is considered to be delivered 
with an own vehicle. For every pair of two cities a so-called savings value is 
calculated; this value specifies the reduction of costs which is achieved when 
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the two routes are combined. Then the routes are merged in descending order 
of their saving values if all constraints are satisfied. According to [Lap92] the 
time complexity of the savings heuristic is given as O(n? logn). 


A lot of papers based on savings heuristics have been published. Especially 
Gaskell’s approach [Gas67] is appreciable in this context as it introduces a 
different weighting of the savings with respect to the length of the newly 
inserted route-part as well as the so-called parallel savings algorithm that not 
only examines a pair but rather a n-tuple of routes. 


The Sweep Heuristic 


The sweep heuristic has been introduced by Gillett and Miller [GM74]. It 
belongs to the so-called “successive methods” in the sense that the ultimate 
goal of this approach is not necessarily the location of the final solution but 
rather the generation of reasonable valid tours which can be optimized by 
some kind of route improving heuristic. 


The fundamental idea of the sweep heuristic can be described as follows: 
Imagining a watch hand that is mounted at the depot, the sweep heuristic 
builds up the first tour starting from an arbitrary angle and takes the cities 
in the order the watch hand sweeps over them as long as all constraints are 
fulfilled. Then the next cluster is established in the same way. 


FIGURE 8.7: Exemplary result of the sweep heuristic for a small CVRP. 
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Figure 8.7 shows a possible solution achieved by a simple sweep heuristic. 
The more customers can be assigned to a route, the better the sweep heuristic 
typically performs. The time complexity of sweep heuristics is O(n logn), 
which is equal to the complexity of a sorting algorithm. 


The Push Forward Insertion Heuristic 


In [Sol87] Solomon describes and evaluates three insertion heuristics for 
the VRPTW. Here a new route is started by a customer which minimizes 
a certain cost function cf;. All unrouted customers are then evaluated for 
insertion between any two customers i and j of the partial route according to 
another cost function cf2. If no feasible insertion is possible for all customers, 
cf; is evaluated again for all unrouted customers in order to determine the 
starting customer of a new route. Three different possible criteria for selecting 
the next customer are the following ones: 


e Farthest customer from the depot first 
e Customer with the earliest due date first 


e Customer with the minimum equally weighted direct route-time and 
distance first 


The third function basically describes the closest customer that will be 
directly reached in time. During the evaluation of their performance Solomon 
states that generally neither is better than the other. The farthest customer 
first criterion is suited for problems with shorter scheduling horizons, while 
selecting the customers regarding the earliest due date gives better results in 
situations where the scheduling horizons are longer, i.e., where vehicles have 
to visit more customers. The third alternative for cf; was not examined closer 
as it was not used in conjunction with the best performing alternative for cost 
function cfo. 

As Solomon notes, the three insertion heuristics that are described as al- 
ternatives for cf are guided by both geographical and temporal criteria. The 
insertion heuristic, which is described first and termed J1, performed best in 
a number of test cases. Basically it extends the savings heuristic insofar as it 
takes into account the prolongation of the arrival time at the next customer. 
This function evaluates the difference between scheduling a customer directly 
and servicing it in an existing route between two customers. Mathematically 
it can be described as 


T1(i, u, j) = Atou — (Q1 (tin + tuj — utij) + 2(b;,, — 65) (8.18) 


with the following restrictions: A, 4, &1, @œ2 > 0 and a; + a2 = 1. In this case 
the VRP cost function c;; equals to 1 for each pair of different customers i 
and 7. 
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Tests with a choice of some configurations for A, 4, a1, and ag showed that 
good results were achieved with A = 2, u = 1, a; = 0, and a2 = 1, thus using 
only time-based savings instead of distance-based savings. 

The name “push forward insertion heuristic” stems from a more efficient 
computation of the feasibility of an insertion. At each point, where a customer 
could be inserted, the time at which the vehicle would arrive later at the 
preceding customer is propagated sequentially through the route. As soon as 
this time becomes 0 the insertion is feasible as the remaining customers would 
not be serviced later than they already are. If the old partial route is feasible, 
then the new one thus will also be feasible. If the push forward value surpasses 
the due date at a customer, then an infeasible insertion is encountered and 
the rest of the route does not have to be checked. In the worst case this 
method still needs to perform the calculation for every customer in the tour. 
Feasibility regarding the capacity constraints, at least for the VRP variants 
without pickup & delivery, is easier to compute. 

Solomon concludes that a hybridization of J with a sweep heuristic could 
achieve excellent initial solutions with a reasonable amount of computation. 
Such an approach can be found in [TPS96] where cf; is a function taking into 
account three different properties: distance, due date, and the polar angle. 
The mathematical description reads 


cfi(u) = —atou + Bou + Yu (8.19) 


with empirically derived weights a = 0.7, 6 = 0.2, and y = 0.1. 


Other Methods 


The problem of building one route at a time (which is done when using the 
heuristics described above) is usually that the routes generated in the latter 
part of the process are of worse quality because the last unrouted customers 
tend to be scattered over the geographic area. Potvin and Rousseou [PR93] 
tried to overcome this problem of the insertion heuristic by building several 
routes simultaneously where the initialization of the routes is done by using 
Solomon’s insertion heuristic: 

On each route the customer farthest away from the depot is selected as a 
“seed customer.” Then, the best feasible insertion place for each unserviced 
customer is computed and the customer with the largest difference between 
the best and the second best insertion place is inserted. Even if this method 
works out better than the Solomon heuristic it is still quite far away from 
optimum. Russell elaborates further on the insertion approach in [Rus95]. 


Another approach built up upon the classical insertion heuristic is presented 
in [AD95]. Defined in a very similar way to the Solomon heuristics, every 
unrouted customer requests an offer and receives a price for insertion from 
every route in the schedule. Then unrouted customers send a proposal to the 
route with the best offer, and each route accepts the best proposal among the 
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customers with the fewest number of alternatives. Therefore, more customers 
can be inserted in each iteration. If a certain threshold of routes is violated, 
then a certain number of customers is removed and the process is started 
again. The results of Antes and Derigs are comparable to those presented in 
[PR93]. As a matter of principle it has turned out that building several routes 
in parallel results in better solutions than building the routes one by one. 

Similar to the route first schedule second principle mentioned previously, 
Solomon also suggests doing it the other way round in the “giant tour heuris- 
tic” [Sol86]. First all customers are scheduled in a giant route and then this 
route is divided into a number of routes. In the paper no computational re- 
sults are given for the heuristic. Implementations of route-building heuristics 
on parallel hardware are reported for example in [FP93] and [Lar99]. 


8.2.2 Genetic Algorithm Approaches 


Applying genetic algorithms to vehicle routing problems with or without 
time constraints is a rather young field of research and therefore, even if a 
lot of research work is done, no widely accepted standard representations or 
operators have yet been established. In the following we will in short discuss 
some of the more popular or promising proposals: 


e A genetic algorithm for the VRPTW has been presented in [TOS94]. 
This algorithm uses the already mentioned cluster first route second 
method whereby clustering is done by a genetic algorithm while routing 
is done by an insertion heuristic. The GA works by dividing each chro- 
mosome into K divisions of N bits. The algorithm is based on dividing 
the plane by using the depot as the center and assuming the polar angle 
to each customer. Each of the divisions of a chromosome then represents 
the offset of the seed of a sector; the seeds here are polar angles that 
bound the sector and thereby determine the members of the sector. 


e The genetic algorithm of Potvin and Bengio [PB96] operates on chromo- 
somes of feasible solutions. The selection of parent solutions is stochastic 
and biased towards the best solutions. Two types of crossover, called 
RBX and SBX, are used. They rarely produce valid solutions and the 
results therefore have to undergo a repair phase as the algorithm only 
works with feasible solutions. The reduction of routes is often obtained 
by two mutation operators and the routes are optimized by local search 
every k iterations. 

The approach described in [PTMC02] is similar, but does not use trip 
delimiters in the actual representation. Instead, the routes that a solu- 
tion is composed of are saved separately as ordered sets. The number of 
routes is not predefined and practically only limited by the number of 
customers. The crossover described in [PTMC02] is biased insofar as it 
uses distance information to decide where to insert a segment, though 
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an unbiased generic variant of this crossover has been defined later in 
[TMPC03]. A repair method is applied (after applying the crossover) 
which constructs a feasible solution by removing customers that are vis- 
ited twice. Additionally, any route that has become too long (so that the 
demand or time window constraints would not be satisfied) is split be- 
fore the violating customer. This algorithm has been applied on several 
benchmark instances of the CVRP and CVRPTW problem variants. 


An encoding similar to the TSP is used in [Zhu00] and [Pri04]. As 
described in these publications, a GA optimizes solutions for the prob- 
lem using a path representation that does not include the depot. The 
representation is thus similar to the TSP and the subtours are iden- 
tified deterministically whenever needed. [Pri04] describes a genetic 
algorithm that is applied to the DVRP using a splitting procedure that 
will find the optimal placement of the depots; this guarantees the fea- 
sibility of solutions in any case. Additionally, classic operators like the 
order crossover (as used for tackling the TSP) are applied without mod- 
ifications. This approach does not rely solely on unbiased operators and 
is hybridized with a local search procedure, while [Zhu00] makes use 
of biased crossover operators in which distance information is used for 
determining the point where a crossing might occur. In this approach 
an initial population, consisting of individuals created by suited con- 
struction heuristics as well as randomly generated individuals, is also 
used. Additionally, the mutation probability is adjusted depending on 
the diversity in the population leaving a minimum probability of 6%. 


A cellular genetic algorithm has been proposed in [AD04]. It uses an en- 
coding with unique trip delimiters such that the customers are assigned 
numbers from 0 to (|C| — 1) while the trip delimiters are numbers from 
|C| to (|C| + |V| — 1). The representation of a solution thus consists of a 
string of consecutive numbers and is syntactically equal to a TSP path 
encoding. This allows the use of crossover operators known from the 
TSP such as the ERX. For mutating solutions the authors use insertion, 
swap, and inversion operators which are similar to relocate, exchange, 
and 2-opt as described below. The difference is that swap and inversion 
are used in an inter- as well as an intraroute way. There is also a local 
search phase which is conducted after every generation; in this phase all 
individuals are optimized by 2-opt and A-interchanges. The best results 
have been achieved using both methods and setting À = 2. 


8.2.2.1 Crossover Operators 


Sequence-Based Crossover (SBX) 


The sequence-based crossover (SBX) operator has been proposed for the 
CVRP in [PB96], but it is applicable also to other VRP variants as well. 
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Generally, it assumes a path representation with trip delimiters, but can be 
applied on a representation without trip delimiters in which case the subtours 
have to be calculated first. 

It works by breaking a subtour in each parent and linking the first part of 
one parent to the second part of another parent. The newly created subtour 
is then copied to a new offspring individual and completed with the genetic 
information of one of the parents. This behavior is exemplarily illustrated in 
Figure 8.8. 

This operator is very likely to create ill-formed children with duplicate or 
unrouted customers; therefore the authors also propose a repair method which 
creates syntactically valid genetic representations. However, feasibility cannot 
be guaranteed in every case as it is not always possible to find a feasible 
insertion space for all unrouted customers. In such a case the offspring is 
discarded and the crossover is executed anew with a new pair of parents. It 
is stated in [PB96] that when applied on the Solomon benchmark set [S0187] 
50% of the offspring are infeasible. 


FIGURE 8.8: Exemplary sequence-based crossover. 


Let us for example consider the following tours in path representation with 
trip delimiters where the depot is denoted as 0 and all other values represent 
customers. 


(0123045 6 0) and (025301 4 6 0) 


In this case the SBX would randomly select two cut points in both solutions, 
for example at customer 2 in the first parent and customer 4 in the second 
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one. Then the first half of the route in the first solution is concatenated with 
the second half of the route in the second solution yielding 


(0 1 2 6 0) 


This route then replaces the route with the selected customer in the first 
solution; thus the solution becomes 


(0126045 6 0) 


Obviously, now customer 6 is served twice, while customer 3 is not served at 
all and the repair procedure will have to correct this situation: First it will 
remove all duplicate customers in all routes except the new one which was 
just formed by the concatenation. This results in 


(01260450) 


Then it will try to route all unserviced customers in that location in which the 
detour is minimal. For this example let us assume that this is after customer 
5 and the final offspring solution thus is 


(01260 4 5 3 0) 


Route-Based Crossover (RBX) 


The route-based crossover (RBX) operator has also been proposed in 
[PB96]. It differs from the SBX insofar as subtours are not merged, but rather 
a complete subtour of one parent is copied to the offspring individual filling 
up the rest of the chromosome with subtours from the other parent. This 
procedure is exemplarily illustrated in Figure 8.9. Again, the operator does 
not produce feasible solutions in all cases and a repair procedure is needed. 

Let us again consider the following tours in path representation with trip 
delimiters where 0 denotes the depot and all other values represent customers. 


0 12 3 04 5 6 0) and (0 2 5 30 1 4 6 0) 


The RBX randomly selects a complete route in solution 1, in this case for 
example the first route: 
(0 1 2 3 0) 


This route then replaces a route in the second solution and thus we get 
0 123 01 4 6 O) 


Obviously, now customer 1 is served twice, while customer 5 is not served at 
all. The same repair procedure as the one described for the SBX operator is 
applied here as well, resulting in the following possible final solution: 


(01230465 0) 
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FIGURE 8.9: Exemplary route-based crossover. 


Other Crossover Operators 


The crossover that is introduced in [PTMC02] does not necessarily concate- 
nate the opposite ends of two partial routes such as the SBX, but inserts a 
partial route of one solution into a good location of another solution. The 
fitness of such a location is determined by the distance between the customer 
which would precede the new partial route and the first customer of that 
partial route. Such an approach works well when solving the CVRP, but as 
is noted in [TMPC03] it does not help in the CVRPTW; this is in fact not 
surprising as distance alone might not be sufficient enough to determine the 
fitness of an insert when there are additional constraints which to a certain 
degree determine the shape of a route. Thus, a generic variant of the crossover 
is proposed which works similar to the RBX, except that it does not replace, 
but rather appends the new route to a given solution. Removing customers 
which are served twice then is the only necessary repair method. Such a re- 
moval also has the benefit that any solution remains feasible if it has been 
feasible before. 

Other genetic algorithm approaches such as the one described in [Pri04] 
build on a path representation without trip delimiters as in the TSP. This 
allows the application of those crossover operators that have been mentioned 
in Section 8.1.4.1 without modifications. In [Zhu00] the PMX is compared 
to two new crossovers called “heuristic crossover” and “merge crossover” that 
take into account spatial and temporal features of the customers. However, 
these new operators achieve slightly better results only for those problem 
instances in which the customers are clustered, whereas for the other cases 
of randomly placed customers and a mix of random and clustered customers 
PMX showed better results. 
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8.2.2.2 Mutation Operators 
Relocate 


The relocate operator moves one of the customers within the solution string 
from one location to another randomly chosen one. An example of this be- 


havior is illustrated in Figure 8.10. 


Relocate(i,j) 
o oa 


M S 


+m, +m: 


FIGURE 8.10: Exemplary relocate mutation. 


Exchange 


The exchange operator selects two customers within different tours and 
switches them so that they are served by the other vehicle, respectively. Both 
relocate and exchange operators are similar to a (1,0) and (1,1) A-exchange 
defined by Osman [Osm93]. An example of the exchange behavior is shown 
in Figure 8.11. 


FIGURE 8.11: Exemplary exchange mutation. 


Combinatorial Optimization: Route Planning 153 


2-Opt 


The 2-opt operator selects two sites within the same route and inverts the 
route between them, so that the vehicle travels in the opposite direction. An 
example of this behavior is given in Figure 8.12. 


J ‘a ta 


FIGURE 8.12: Example for a 2-opt mutation for the VRP. 


a 


2-Opt* 


The 2-opt* operator behaves like a one point crossover operator in a tour: 
It first selects two customers in two different tours and creates two new tours; 
the first tour here consists of the first half of the first tour unified with the 
second half of the second tour, and the second tour consists of the first half 
of the second tour unified with the second half of the first tour. An example 
for the behavior of this operator is illustrated in Figure 8.13. 


| 2-Opt*(i,j) 


<> 


TTE 


FIGURE 8.13: Example for a 2-opt* mutation for the VRP. 
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Or-Opt 


The or-opt operator takes a number of consecutive customers, deletes them 
and inserts them at some point of the same tour. An example of this behavior 


is given in Figure 8.14. 
antag a f J} 


FIGURE 8.14: Example for an or-opt mutation for the VRP. 


One Level Exchange (M1) 


The one level exchange (M1) operator is mentioned in the context of the 
GA proposed in [PB96]. It tries to eliminate routes by inserting the customers 
into other routes while maintaining a feasible solution. This operator favors 
small routes with higher probability, because these are in general easier to 
remove. The probability is chosen such that a trip of size N is half as likely 

N 


to be chosen as one of size z 


Two Level Exchange (M2) 


The two level exchange (M2) operator looks one level deeper than the M1 
operator as it removes routes from the whole route, but tries harder for each 
of the customers. After selecting a trip using the same bias towards smaller 
routes as the M1, each customer is tried to be inserted instead of another 
customer in a different route which in turn is tried to be inserted in any other 
place (except the originally selected trip). If such a feasible insertion is found, 
then the second selected customer is inserted and the first customer is inserted 
at the second customer’s original place. This operator is more likely to find 
feasible insertion places, but requires quite a lot of computational effort; in 
the worst case the runtime complexity is O(N°). 


Local Search (LSM) 


Another operator which is also proposed in [PB96] applies several or-opt 
exchanges until a local optimum is reached. It first selects all possible combi- 
nations of three consecutive customers and tries to insert them in any other 
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place while still maintaining the feasibility of the solution. If no solution can 
be found the process is repeated with two consecutive customers and finally 
with every single customer. Once a better solution has been found, the op- 
erator starts again with three consecutive customers and continues until no 
further improvement is possible. Local search methods can improve solutions 
on the one hand, but on the other hand they also might reduce the diversity 
in a population when applied to several individuals which could end in the 
same local minimum. 


Chapter 9 


Evolutionary System Identification 


9.1 Data-Based Modeling and System Identification 
9.1.1 Basics 


In general, data mining is understood as the practice of automatically 
searching large stores of data for patterns. Nowadays, incredibly large (and 
quickly growing) amounts of data are collected in commercial, administrative, 
and scientific databases. Several sciences (e.g., molecular biology, genetics, as- 
trophysics, and many others) produce extreme amounts of information which 
are often collected automatically. This is why it is impossible to analyze 
and exploit all these data manually; what we need are intelligent computer 
systems that can extract useful information (such as general rules or inter- 
esting patterns) from large amounts of observations. In short, “data mining 
is the non-trivial process of identifying valid, novel, potentially useful, and 
ultimately understandable patterns in data” [FPSS96]. 

One of the ways how genetic algorithms and, more precisely, genetic pro- 
gramming can be used in data mining is its application in data-based model- 
ing. A given system is to be analyzed and its behavior described by a mathe- 
matical model; the process is therefore (especially in the context of modeling 
dynamic physical systems) called system identification [Lju99]. 

The principles have already been summarized in the GP introduction chap- 
ter, especially in Section 2.4.3 on symbolic regression, and they shall be re- 
peated and extended in the following: 

The main goal of regression is to determine the relationship of a dependent 
(target) variable t to a set of specified independent (input) variables x. Thus, 
what we want to get is a function f that uses x and a set of coefficients w 
such that 


t= f(x,w) +e (9.1) 


where e represents the error (noise) term. 

Applying this procedure we assume that a model can be created with which 
it will also be possible to predict correct outputs for other data examples (test 
samples); from the training data we want to generalize to situations not known 
(or allowed to analyze) during the training phase. 
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When it comes to evaluating a model (i.e., a solution candidate in a GP- 
based modeling algorithm), the formula has to be evaluated on a certain set of 
evaluation (training) data X yielding the estimated values Æ. These estimated 
target values are compared to the original values T, i.e., those which are known 
from data retrieval (experiments) or calculated applying the original formula 
to X. 

This comparison is done by calculating the error between original and cal- 
culated target values. There are several ways how to measure this error, one of 
the simplest and probably most frequently used one being the mean squared 
error (mse) function; the mean squared error of the vectors A and B each 
containing n values is calculated as 

ape : Ak — Bx)? 9.2 
mse(A, B) = =+ (Ax — Br) (9.2) 


k=1 


Some of the major problems of data-based modeling are noise and overfit- 
ting: 


e In common language, on the one hand we know noise as in general that 
what is heard, but on the other hand also as unwanted sound which is 
added to the audio signals that are of interest. Furthermore, the concept 
of noise is also known in image and video processing, where it is used 
more to describe unwanted signals that are rather disturbing. In the 
context of data-based modeling we often see that additional and some- 
how unwanted values are added to the original signals; this disturbing 
additional data is called noise. 


e In machine learning, overfitting is understood as the exceeding fitting 
of models to given data. As already mentioned, data-based training of 
models is done using training data, i.e., sets of training examples of the 
functions which are searched for; the problem is that it can happen — 
especially in cases where too complex models are trained or the training 
process is executed too long — that the learner may adjust to very specific 
features or samples of the training data. Even a structurally inadequate 
model may fit to given training data perfectly if the model is complex 
enough. 

From the point of view of mathematical systems theory, we assume that 
a system © can be described by a function ¢(@) : u — y, where u 
and y are the system’s input and output, respectively, @ describes the 
structure of the function and 0 denotes the vector of parameters. Data- 
based structure identification is supposed to find a function Yô) u —> y 
that reproduces the system’s output. The more parameters are stored in 
Ê the easier it becomes to reproduce the given training data, but it also 
becomes more probable that 7(9) represents not the basic behavior of © 
but rather the measured signal (which also includes noise). Of course, 
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as we do not know the size of 0 (or the structure of ¢) in general, we 
cannot know when 6 becomes “too big.” 

Overfitting can also be seen as a violation of Occam’s razor (see Section 
2.6 for explanations on this); fitting too exactly to (noisy) training data 
might lead to a model whose ability to generalize is far worse than the 
general applicability of a simpler model. 

Unfortunately there is no rule how to generally avoid overfitting as we 
often do not exactly know the complexity of the system whose behavior 
is to be modeled. However, there are several techniques that can help 
to avoid it: For instance, overfitting might cause a significant rise of the 
variances of the estimated parameter values ĝ;, i.e., the parameter values 
estimated in independent identification runs diverge (which should of 
course not be the case if the structure of Y and the size of 6 are correct); 
early stopping and the use of validation sets which are not included in 
the training data can also help to decrease the probability of overfitting. 


Thus, accuracy (on training data) is not the only requirement for the re- 
sult of the modeling process: Compact and (if possible) minimal models are 
preferred as they can be used in other applications easier. It is, of course, not 
easy to find models that ignore unimportant details and capture the behavior 
of the system that is analyzed; due to this challenging character of the task 
of system identification, modeling has been considered as “an art” [Mor9]]. 

In the following section we are going to explain the problems of noise and 
overfitting using a simple example. 


9.1.2 An Example 
9.1.2.1 Learning Polynomial Models 


Let us consider the following example: Let S be a system whose behavior 
is to be modeled using the input / output (target) training examples given in 
Table 9.1 (where X and Y values denote input and output data, respectively). 


By looking at these values as they are shown in Figure 9.1 the suspicion is 
aroused that there might be a cubic connection between the X and Y values, 
distorted by additive noise. This is in fact correct: The data were generated 
using the model y = x? — 100x + 100 and adding noise (uniformly distributed 
in the interval [|-250; +250]). This is why the original function x? — 100z +100 
is also depicted in Figure 9.1. 

If we want to evaluate the original formula that was used for simulating 
the system (x? — 100x + 100), we can for example evaluate this model on all 
integral values for X in the range of the given training data (i.e., -15, -14, ..., 
4, 5) and calculate the mean squared differences of these calculated values and 
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Table 9.1: Data-based modeling example: Training data. 


x| Y _ Jx| Y 
-15 | -1571.1605 | -4 | 229.6581 


Original Data 
T 


1000 


-1000 ’ : : 4 


-1500-7 + 
"a 


Mean squared error: 18556.4719] 
-10 -5 0 5 


-2000 
-15 


FIGURE 9.1: Data-based modeling example: Training data. 


the given training target data for Y which yields 18,556.4719 — the “fitness” 
of the original formula therefore is approximately 18,556. 

Now let us suppose that we do not know or suspect anything about the 
system or its order. We could therefore try for example polynomial approaches 
of order 2, 3, 10, and 20; thus, we assume model structures of the form 


y = ao + ax + age? +... + anr” (9.3) 
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for a model of order n. The parameters [ao, a1, a2,..., an] are now to be set 
so that the model fits the given training data as exactly as possible. 

As we see in the Figures 9.2, 9.3, 9.4, and 9.5, the quadratic model performs 
fairly, the model of order 3 performs better on the given training data, and the 
models of order 10 and especially 20 perform even a lot better; the polynomial 
of order 20 is even able to explain the training data perfectly. 

The quality of the so generated models of order 1, 3, 10, and 20 is approx- 
imately 244,218, 14,435, 6,605, and 01, respectively. 


Order 1, MSE (training): 244217.8279 
1000 1 


-1000 


-15001 


-2000 r i x 
-15 -10 -5 0 5 


FIGURE 9.2: Data-based modeling example: Evaluation of an optimally fit 
linear model. 


9.1.2.2 Testing Polynomial Models 


Now let us assume that test data are available for evaluating the models; 
these test data are not included in the training data but rather used for esti- 
mating the quality of the models produced (and of the identification method 
itself). These test data are given in Table 9.2. 


Now we see that the linear model performs even worse on the test data 
(Mse€test ~ 25 * 10°, see Figure 9.6); the cubic model, which performed a lot 
better in training, is much more accurate also on test data (MSetest © 4.5*10°, 
see Figure 9.7) 


l Minor inaccuracies are here due to numerical imprecisions. 
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Order 3, MSE (training): 14435.8497 
1000 7 


-500 


-1000 


-1500 


-2000 


FIGURE 9.3: Data-based modeling example: Evaluation of an optimally fit 
cubic model. 


Order 10, MSE (training): 6605.174 
T 


1000 


-500 


-1000 


-1500 


-2000 
-15 


FIGURE 9.4: Data-based modeling example: Evaluation of an optimally fit 
polynomial model (n = 10). 


So, does this trend go on and does better fit on training data guarantee better 
fit on test data? Analyzing the test performance of the models of order 10 
and 20 the answer to this question obviously is: No. In Figure 9.8 we see that 
the polynomial model of order 10 predicts values out of the range of the given 
test data yielding a mean squared error value of 5 x 1016. The model of order 
20 is not shown; its mean squared error on test data is 5.8 x 1034. 
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Order 20, MSE (training): 3.4512e-005 
4000 


3000} 


2000} 


1000} 


-1000 


a 


FIGURE 9.5: Data-based modeling example: Evaluation of an optimally fit 
polynomial model (n = 20). 


Order 1, MSE (test): 25343377.6071 
T T 


FIGURE 9.6: Data-based modeling example: Evaluation of an optimally fit 
linear model (evaluated on training and test data). 


Summarizing this example we give an overview of training and test errors 
for the data and models mentioned above in Figure 9.9 (models of order 0 and 
5 were created in the same way as the other models). This behavior is typical: 
As the number of parameters increases, the training errors decrease; in the 
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Table 9.2: Data-based modeling example: Test data. 


877.4408 | 22 | 
1149.4064 | 23 | 


Order 3, MSE (test): 4516768.3077 
T T T 
T 


FIGURE 9.7: Data-based modeling example: Evaluation of an optimally fit 
cubic model (evaluated on training and test data). 


beginning, test errors also tend to decrease?, but after some time (as soon 
as overfitting happens), test errors start to increase with increasing training 
effort. 

Please note that the training and test errors shown in Figure 9.9 are depicted 
on a logarithmic y-axis. 


2In the summary chart displayed in Figure 9.9 we have intentionally omitted the training 
and test errors for n = 2. The reason is that it would have shown that in this particular 
case the test error for the quadratic model is a lot worse than for the linear as well as the 
cubic model; this would be correct, of course, but in this way it is easier to sketch the 
characteristic behavior of first decreasing and then increasing test errors as the number of 
parameters increases. 
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Order 10, MSE (test): 49584496262619024 


FIGURE 9.8: Data-based modeling example: Evaluation of an optimally fit 
polynomial model (n = 10) (evaluated on training and test data). 


FIGURE 9.9: Data-based modeling example: Summary of training and test 
errors for varying numbers of parameters n. 


9.1.2.3 Implementation 


All data generation and modeling steps used here have been implemented 
in MATLAB®, Version 7.0 (R14); the source code representing the imple- 
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mentation of this example can be found on the website of this book.’ 

For the fitting of a polynomial of order n we first compose a matrix M as 
a concatenation of the input values (namely the x values of the training data, 
i.e., all values in [—15;5]) potentiated by 0,1,..., 7: 


Z = [X9 Xt... X"], X" = ae ... el (9.4) 


where X* is a column vector consisting of all N input values to the power of 
n. 

Secondly, the training target values Y are, after transposing the matrices, 
divided by Z using the right matrix division function (/); this numerically 
solves the system of linear equations defined by the order of the model n, 
the input data Z, and the target values Y. Thus, we get the coefficients 
0,1.. -an in the result of this division (as a vector p) and calculate the 
estimated target values Y (denoted in the source code as Yhat) by multiplying 
poly and Z; this represents the evaluation of the identified polynomial for each 
given sample. 

The training and test qualities are calculated using the mean squared errors 
function, i.e., we calculate the sum of squared residuals and divide by the 
number of samples considered. 

The data documented in Section 9.1.2 were generated using a noise range 
of 500. 


9.1.3 The Basic Steps in System Identification 


The following two phases in data-based modeling are often distinguished: 
Structural identification and parameter optimization. 


e First, structural identification is hereby seen as the determination of 

the structure of the model for the system which is to be analyzed; phys- 
ical knowledge, for example, can influence the decision regarding the 
mathematical structure of the formula. This of course includes the de- 
termination of the functions used, the order of the formula (in the case 
of polynomial approaches, e.g.), and, in the case of dynamical models, 
potential time lags for the input variables used. 
In the simple example given previously this step was the decision to 
use a polynomial modeling approach; for example, the decision to try a 
polynomial model y = ao + a,x 4 aga? + ... + ang” of specific orders 
was the structural identification part. As we tried several polynomials 
of different orders we simply executed the procedure several times; this 
is exactly what is indicated by the feedback loop in Figure 9.10 (a). 


e Parameter identification is then the second step: Based on training data, 
the parameters of the formula are determined (optimized) meaning that 


Shttp://gagp2009.heuristiclab.com. 
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FIGURE 9.10: The basic steps of system identification ([BES01], [WER06]). 


the coefficients and, if used, time lags are fixed. 
Basically, this is what we did in the previous example by calculating the 
coefficients for the polynomials of different orders separately. 


This separation is schematically shown in the left part (a) of Figure 9.10 
(adapted from [BES01]). 

Of course, the whole process of building models out of data includes more 
steps than those mentioned above. Especially data preprocessing is a very 
important issue, i.e., preparing data before it is used for the “real” modeling 
process. Data downsampling, filtering, and the removal of data without in- 
formation can be applied in order to retrieve preprocessed data on which it is 
easier to efficiently generate appropriate models. 

Variables selection is also often considered a key issue in data-based mod- 
eling: Those variables are selected from the pool of variables available which 
shall be used for the essential modeling process. For example, variables which 
do not include information (since they are constant in the whole data set, e.g.) 
or are redundant to other ones can be omitted for simplifying the modeling 
process. Variables selection can thereby be done using expert knowledge or 
statistical methods. Exhaustive statistical methods are available as well as 
sequential iterative forward or backward variable selection: 


e Exhaustive search is executed by computing all possible combinations of 
variables and evaluating them; exactly that combination of channels will 
be selected which provides best approximation of measurement data. 
This method is able to provide an optimal solution (if the process is 
linear), but especially for higher dimensional problems (including big 
numbers of channels) it requires excessive computation time. In order 
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to overcome this drawback, forward and backward selection can be used 
as alternatives even if they provide only suboptimal solutions. 


e In sequential forward selection the algorithm sequentially derives the list 
of input channels. In the first step, only one input channel is considered 
where that channel is selected that minimizes the sum of squares errors. 
In the next step, another input channel is selected where once again that 
channel is chosen which minimizes the sum of squares errors; the algo- 
rithm iteratively adds more and more input channels until a predefined 
accuracy is reached and hence the algorithm terminates. Of course the 
results depend on the chosen basis functions. 


e The main difference when applying backward selection is that the al- 
gorithm starts with all variables available in a set of selected variables 
and then iteratively removes variables that do not have a statistically 
measurable connection with the observed (measured) target values. 


e Hybrid variants combining backward selection and a subsequent forward 
selection step have also been investigated for producing good results very 
efficiently. 


These basic steps of the data driven modeling process are shown in the right 
part (b) of Figure 9.10. 

As we see in both diagrams shown in Figure 9.10, the total system identifi- 
cation process based on measurement data is not finished as soon as models 
are created. A decision whether the model at hand is appropriate and fulfills 
the given quality requirements has to be made during a subsequent validation 
step. If this validation (often also called test phase*) fails, the process might 
be repeated starting again at the structural identification or data preprocess- 
ing step. 

The major drawback of this classical approach is obvious: As the structure 
of the model has to be fixed before identifying parameters, thus it has to use 
a priori knowledge. However, there is a large number of applications in which 
the a priori model information is not available to the desired precision. For all 
these cases, several generic so-called “model free” approaches are widely used, 
ranging from simple static maps up to self-organizing neural networks; see for 
instance [ARLF*05] for ANN-based identification of a Diesel engine’s NO, 
emissions, [PP01] for a specific spectral analysis tool to describe the behavior 
of a plant or [THL94] for a neural network approach to optimal filtering of 
sensor data. 


4Please note that in some cases the terms validation and test phase are used synonymously, 
but often (and also in the following test case documentations) the validation and test 
phase are separate model analysis phases. Detailed explanation is to come in the following 
sections. 
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In spite of the evident simplicity of generic approaches, the drawbacks are 
known as well: Over-parameterization, lack of extrapolation and often even 
of interpolation capabilities [dRLF* 05], large data requirements, etc. 


9.1.4 Data-Based Modeling Using Genetic Programming 


Using Genetic Programming for data-based modeling has the advantage 
that we are able to design an identification process that automatically in- 
corporates variables selection, structural identification, and parameters opti- 
mization in one process. 

In GP, the function f which is searched for is not of any pre-specified 
form when applying genetic programming to data-based modeling; low-level 
functions are during the GP process combined to more complex formulas. 
Given a set of functions fi,..., fu, the overall function induced by genetic 
programming can take a variety of forms. Usually, standard arithmetical 
functions such as addition, subtraction, multiplication, and division are in the 
set of functions f, but also trigonometric, logical, and more complex functions 
can be included. 


Thus, the key feature of this technique is that the object of search is a 
symbolic description of a model, not just a set of coefficients in a pre-specified 
model. This is in sharp contrast with other methods of regression, including 
linear regression, polynomial approaches, or also artificial neural networks, 
where a specific structure is assumed and often only the complexity of this 
model can be varied. 

Of course, data preprocessing and a separate validation / test phase are 
also parts of the GP-based modeling process; the main workflow is sketched 
in Figure 9.11. 

In the following we are going to give an overview of our system identifi- 
cation implementation in HeuristicLab in Section 9.2 and discuss concepts 
developed for analyzing the similarity of mathematical models produced by 
GP in Section 9.4. 


Two typical application scenarios for GP-based modeling are then analyzed 
using real-world test data as well as benchmark data: 


e In Section 11.1 we bring basics and examples for time series analysis 
and the design of so-called virtual sensors; 


e in Section 11.2 we demonstrate classification as a possible application 
for GP-based structure identification. 


In both cases we discuss the effects of using enhanced concepts for GAs that 
have been discussed in the previous chapters as well as advanced GP concepts 
that are to be described in the following sections. 
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FIGURE 9.11: The basic steps of GP-based system identification. 


9.2 GP-Based System Identification in HeuristicLab 
9.2.1 Introduction 


The HeuristicLab (HL) is a framework for developing and testing optimiza- 
tion methods, parameters and applying these on a multitude of problems. 
The project was started in 2002 and has evolved to a stable and productive 
optimization platform; it is continuously enhanced and topic of several pub- 
lications ([WA04c], [WA04a], [WA04b], [WA05a], and [WWB+07]). On the 
HeuristicLab website the interested reader can find installable software, in- 


Shttp://www.heuristiclab.com. 
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formation, documentation, and publications in the context of HeuristicLab 
and the research group HEAL®. 

This extensible and flexible framework enables us to combine the advanced 
GA concepts with genetic operators for GP; operators for analyzing dynam- 
ics in GP populations can be integrated as well as evaluators that compare 
training, validation, and test qualities. 

Here we want to summarize how system identification problems are repre- 
sented in HeuristicLab, how we have designed an appropriate solution encod- 
ing and respective operators, and finally show how we have defined a similarity 
measure for these solution candidates. 


9.2.2 Problem Representation 


A system identification problem instance has to include all data which are 
needed by genetic programming for generating models describing the under- 
lying system’s behavior. 

The most important part of the representation of a system identification 
problem, that is to be tackled with genetic programming, is the data collection 
storing all available measurement data; the index of the target variable also 
has to be known and available for the modeling algorithm. 

Furthermore, there also has to be an indication which data samples are 
to be used as training, validation, and test data (in our case given as start 
and end sample indices). The use of these data segments is different for each 
particular partition: 


e Training data are the real basis for the algorithm; the modeling al- 
gorithm is able to use these training examples of the input / output 
behavior of the system at hand (or rather of the model that is to be 
learnt) for determining the quality of solution candidates (which in our 
case here are models / formulas). 


e Validation data are available for the training algorithm, but normally 
not used for the real evolutionary optimization process. These data can 
for example be used for detecting overfitting phenomena, for pruning, 
or other additional model manipulation operations. 


e Test data, finally, may not be considered by any part of the training 
algorithm. Still, these data shall be used for testing the created mod- 
els on new data, i.e., data not included in the algorithm’s data base, 
so that we can determine whether the algorithm was able to generate 
appropriate models or not. 


Additionally, there also has to be a possibility to state which variables of 
the data base are really available for the modeling algorithm. For example, 


Heuristic and Evolutionary Algorithms Laboratory, Linz / Hagenberg, Austria. 
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this becomes relevant when sensor data are included in the data base and used 
for statistical analysis (correlation analysis, automated fault detection, etc.), 
but the models that are to be generated for a certain target variable are still 
not supposed to contain these variables. 

Pure availability of a variable is still not sufficient information; what we 
also need is whether or which time offsets are allowed when referencing a 
variable. For example, let y be the target variable and u, v, and w possible 
input variables for a model for y; as we want to model y for time (sample) t we 
search for a model for y+. The first crucial decision to be made is whether we 
want to generate static or dynamic models: In static models, only inputs at 
time t are considered for describing the target variable at time t; our target y 
would be described as a function f : ye = f (uz, vz, wt). In dynamic modeling, 
on the contrary, input variables can also be referenced with a certain time lag 
meaning that not only values of time t are used but also “historic” data. For 
example, f could then be a function modeling y¢ using Ut—4, ve-1, Ve-2, and 
Wt. 

In several application scenarios one also explicitly excludes input values 
of time t; what we get by excluding contemporary input data is a predic- 
tion model that can also be used for modeling future values on the basis of 
previously measures / recorded data. 

Furthermore, the generation of autoregressive models also becomes possible: 
Autoregressive models are formulas that model an output y; incorporating 
previous outputs Yt—1, Yt—-2,---) Yt—tman} AN exemplary autoregressive model 
for our example could be far : Yt = Ut + Yt-2 + We-1- 

So, as the target variable can also be used with certain time offsets, GP is 
also able to generate autoregressive models. 

Lots of additional information for system identification problem instances 
can also be very useful in the modeling process: 


e Complexity limits for the models that are to be created can be given as 
maximum values for the height as well as the size of the models. Height 
hereby is equal to the height of the respective model structure tree as 
is to be described in Section 9.2.4; size refers to the number of nodes of 
the structure tree. 


e Meta-information such as descriptions of the data and the underlying 
system, or descriptions and names of the variables in the data base, e.g. 


e A collection of function and terminal definitions that can be used for 
compiling and evaluating models — a detailed description about the man- 
agement of function bases is about to come in Section 9.2.3. 


e The best solutions found so far - this of course also has to include at 
least information about 


— the data partitions used as training and validation data, 


Evolutionary System Identification 173 


— the evaluation operator and respective parameter settings applied 
for evaluating solution candidates, 


— which variables were used in the modeling process applying which 
minimum and maximum offsets, and 


— the function and terminal definitions that were available for com- 
piling and evaluating models. 


Specific parameters for classification problems shall be described in Section 
11.2 on learning classifiers using GP. 


9.2.3 The Functions and Terminals Basis 
9.2.3.1 Motivation, Introduction 


The correct design of the functions and terminals basis used for compiling 
and evaluating formulas is one of the most crucial issues in the design of a 
GP-based system identification approach; for the sake of simplicity we will 
in the following refer to this pool of definitions of functions and terminals 
as functions basis. In fact, this is not wrong since terminal definitions are 
also functions that take several inputs such as a reference to the data basis, 
the variable and sample indices, a (time) offset, and a concrete coefficient for 
calculating the returned value. Still, as the handling of terminals differs a lot 
from the handling of functions, we will also treat them separately whenever 
necessary. 

Regarding the implementation, HeuristicLab and all plugins (at least until 
now) are implemented in C# using the .NET framework, so the most obvious 
approach would be to use the functions of the .NET framework for building 
models; essentially, this was done in our GP implementation for the versions 
1 and 1.1 of HL ([Win04], [WAW05a], [WAW05b], [WAW06a], [WAW0O6e], 
[WAW06c], [WEAT06]). 

During own research activities and in the course of discussion with research 
partners in academics as well as industries we became more and more con- 
vinced that it would be a great benefit for GP-based modeling if the users 
were able to program and parameterize the functions and terminals by them- 
selves. So, starting from the implementation in HL 2.0, a flexible and user- 
programmable functions basis has been used. 

The definition of the evaluation of functions and terminals surely is the core 
of any functions and terminals management unit. So, for each function as well 
as for every terminal definition we have to be able to manage the source code 
that represents the definition of its evaluation, compile it, and provide the 
compiled functions to the GP process. 

In detail, these definitions are designed and implemented in HeuristicLab 
as is explained in the following sections. 
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9.2.3.2 Definition of the Evaluation of Terminals 


The definition of the evaluation of a terminal is given by a function that 
requires a reference to the data basis, variable and sample indices, a sample 
offset, and a coefficient as inputs; depending on the selected terminal defi- 
nition, this information is processed and the return value calculated. So, a 
terminal definition t is a function of the data collection D, the variable index 
v, the sample (time) index s, a sample offset o, and a coefficient c. 

Let us consider the following examples tvar, tars, and teonse representing 
standard variable, differential, and constant definitions: 


tyar(D, v, 8,0,c) = c* Div, s — o] 
tait t(D, v, 8, 0, ¢) = c* (Dlv,s — o] — Div,s — o — 1]) 


teonst(D, v, 5,0, c) =E 


tvar calculates the product of the given coefficient multiplied with the value 
of variable v at sample index s shifted by o indices, thus taking the value 
D[v,s—o]. taiff calculates the difference of the referenced values at D[v, s — o] 
and its predecessor, D[v, s—o—1], and returns it multiplied with the coefficient 
C. teonst, finally, simply returns the given coefficient and thus represents a 
constant terminal definition. 

The definition of such a terminal can of course become arbitrarily simple 
or complex, depending on the user’s intention. Anyway, in HL the definition 
of the evaluation functions has to be done in C# notation using the following 
interface: 

public double TerminalEvaluation(double[][] Data, 

int Var, int Sample, int Offset, double Coeff); 

The implementation of a terminal definition thus is a method following the 
interface given above. The respective source codes for the exemplary terminals 
tvar, taipf, and teonst could be defined in the following way: 


tvar : return Coeff * Data[Var] [Sample-Offset] ; 
tqiff : return Coeff * ( Data[Var] [Sample-Offset] - 
Data[Var] [Sample-Offset-1] ) ; 


teonst : return Coeff; 


9.2.3.3 Definition of the Evaluation of Functions 


The interface for function evaluation definitions is a lot simpler than the 
evaluation interface for terminals as described above: A function is simply 
defined by the way it calculates a value given a set of input values. Addition- 
ally, we also use a variant index so that it is possible to define several variants 
of functions within one function definition. So, a function definition f is a 
function of the input data vector input and the variant v. 
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Let us consider the following examples fada, faiv, and tirig representing 
addition, division, and trigonometric functions: 


faaa(input, v) = sum(input) 
faiv (input, v) = input[1] /input[2] 


sin(input[1]) Dol 

annta) | costinpuill]) + v=2 

Ferig (input, v) = tan(input(1]) v=3 
error : otherwise 


fada calculates the sum of all input values, fa;, divides the first argument by 
the second one, and firig returns the sine, the cosine, or the tangent of the 
first input, depending on the value of the variant index passed. 

In HL the definition of the evaluation functions has to be done using the 
following interface: 

public double FunctionEvaluation(double[] Args, int Var); 

The implementation of a function definition thus is a method following the 
interface given above; the respective source codes for the exemplary terminals 
fada, faiv, and frig could be defined in the following way: 


Jada : double d = 0; 

for (int i=0; i<Args.Length; i++) 
d += Args[il]; 

return d; 

fas : if (Args[1]==0) return double.NaN; 
return (Args[0] Args[1]); 

frig : if (Var==0) return Math.Sin(Args[0]); 
if (Var==1) return Math.Cos(Args[0]); 
if (Var==2) return Math.Tan(Args[0]); 


throw new Exception("Unknown function variant") ; 


Of course, logical functions can so be integrated into the functions pool as 
well as boolean functions connecting logical and boolean functions. 

Please note that the functions interface definition implemented in HL is a 
bit more sophisticated. In fact, what is also handed over to the function is an 
array storing a certain number of previously calculated values, i.e., a history 
of exactly this function: 

public double FunctionEvaluation(double[] Args, int Var, 

double[] History) ; 

If the history array is used in the evaluation function, then a number of 
pre-defined calculated values are automatically saved in an appropriate array, 
stored, and given to the function at its next evaluation. Thus, it is for example 
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possible to implement an integral function fint using the history hist: 


fint(input, v, hist) = hist[1] + input[1] 


which in HL / C# notation could be implemented as 
Jint : return = History[0] + Args[0]; 


9.2.3.4 String Representations of Terminals and Functions 


Even though the evaluation on a given data basis is the most important 
task for a model, appropriate string representations are also necessary for 
representing formulas in prefix notation, standard notation (as a mixture of 
infix and prefix notations), or in such a way that they can be immediately 
incorporated in MATLAB®, Mathematica® , ATE, or C/C++/C# program 
code. 

For each representation variant there are specific interfaces for terminals and 
functions; in all cases character strings are returned, but the input parameters 
vary significantly. The terminals’ string representations are given the same 
parameters as the evaluation functions (except for a reference to the data 
basis) and, in some cases, the variable name; string representation methods 
for functions use the string representations of the function’s inputs and return 
composed strings representing the function and its inputs. 

In the following we will pick the standard (infix/prefix) notation for demon- 
strating the mechanisms used. For terminals and variables we use the inter- 
faces 

public string Terminal_Standard(int Var, string VarName, int Offset, 

double Coeff); 

public string Function_Standard(string[] Args, int Var); 
respectively. For standard variables and the addition function, for example, 
the respective method implementations could be given in the following way: 


tvar : String s = "[" + Coeff.ToString() + "*"; 
s = s + VarName; 
if (Offset=-0) s = st"(t)"; 
else s = s+"(t-" + Offset.ToString() + ")"; 
return (s + ")]"); 
fada : string s = "(" + Args[0]; 
for(int i=1; i<Args.Length; i++) 
s =s + "+" + Args[i]; 
s=st")"; 


return s; 
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The standard string representations of two terminals referencing variable w 
with time-offset 4 and coefficients 1.2 and 0.9, respectively, would so result in 
[1.2*w(t-4)] and [0.9*w(t-4)]; the standard string representation of the 
addition of these two terminals would be ([1.2*w(t-4)]+[0.9*w(t-4)]). 


9.2.3.5 Parametrization of Terminals and Functions 


Apart from the definitions of evaluation and string representation of termi- 
nals and functions there are several parameter settings for them that are to 
be summarized in this section. 

Terminal definitions can be parameterized in the following ways: 


e The data type and the distribution function of the coefficients allowed 
has to be defined: Coefficients can be 


— either integral values or real-valued, and 


— their distribution can be either uniform (defined by minimum and 
maximum values) or Gaussian (defined by average u and standard 
deviation ø). 


e The set of possible parent types can be defined, i.e., the user is able 

to declare which functions are allowed to use the respective terminal as 
input and which ones are not allowed to do so. 
This selection of possible parent functions can be done either explicitly 
by selecting a set of functions that are allowed as direct parents, or 
implicitly by defining which functions are not allowed as parents of the 
respective terminal type. 


The parametrization possibilities for function definitions are even more than 
those for terminals: 


e An arbitrary number of variants can be defined. Apart from considering 
these variants in the method code (as can be seen in the code for the 
trigonometric function definition in Section 9.2.3.3), each variant can be 
activated and de-activated independent of the other variants. 


e Additionally, for each variant the function’s arity (its number of input 
parameters) has to be defined. The arity can be either fixed or given as 
a range defined by minimum and maximum values. 


e Each function has to define its neutral element(s), also called identity 
element(s). In binary operations working on elements of the set X, an 
element e of X is called left identity with respect to the operation o if 
eoa = a for all elements a in X; in analogy to this, e is called right 
identity with respect to o if aoe = a for all elements a in X. 

This concept of elements that leaves other elements unchanged when 
combined with them is here used in a slightly generalized way as we 
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define neutral (identity) elements for each possible input index of a 
function: 


— There can be one neutral element that is used for all input indices, 
or 


— neutral elements can be defined for each possible input index inde- 
pendently. 


For the addition or subtraction functions, e.g., the neutral element for 
all possible indices is 0, for the multiplication function it is 1 for all 
inputs. But when it comes to the division, then the identity elements 
have to be defined separately for each input: As we divide the first input 
by the second one, the neutral element for the first index is 0, whereas 
for the second input it is 1 (because 0/a = 0 and a/1 = a for all a € R). 


e Similar to the parent type restrictions that can be set for terminals, 
functions can also define a set of valid parent function definitions. Again, 
this can be done either directly or indirectly by selecting functions that 
are not allowed as parent function types. 


e Finally, functions can also define child type restrictions. This can also 
be done directly by selecting certain function or terminal definitions as 
valid child types (i.e., types that are allowed as inputs for the function), 
or by explicitly excluding certain types from the set of possible input 
definitions. 

In order to maximize the flexibility of this child type management con- 
cept, these selections can be done either for all input indices uniformly 
or for each input index separately. 


Function and terminal definitions and their respective parametrizations are 
collected in function and terminal management units which we here, as already 
mentioned before, call “functions bases”. In each functions basis we not only 
store function and terminal definitions, but also which ones are activated 
and which ones are not, and an initial weighting factor is also given for each 
definition denoting its relative probability to be chosen when it comes to 
selecting a randomly chosen function or terminal. 


9.2.4 Solution Representation 
9.2.4.1 Representing Formulas by Structure Trees 


As we have now described how function and terminal definitions are man- 
aged, we shall take a look at the representation of solution candidates for 
GP-based system identification. The most intuitive way to represent models 
is modeling them as structure trees; starting with Koza’s first GP attempts us- 
ing LISP structure trees, the concept of trees representing formulas has had a 
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long tradition in GP (see [Koz92b], [KKS*03b], [LP02], [Kei02], or [PMR04], 
€.g.). 

Structure trees consist of nodes and references from parent nodes to their 
children. Thus, for representing formulas we have to create node structures 
that are able to store all parameters needed as well as references to the function 
and terminal definitions used; this concept is visualized in Figure 9.12. 


function 
Params 
Oa 


Functional Basis 


terminal 


FIGURE 9.12: Structure tree representation of a formula. 


The following parameters have to be stored by structure nodes in addition 
to references to their function or terminal definition: 


e Each terminal node has to store the index of the variable it references, 
the sample (time) offset, and the value of the coefficient that is to be 
used as a multiplicative factor. 

Thus, when it comes to evaluating a terminal node for a given data base 
and a certain sample index, the referenced terminal definition is called 
using the given data and sample index as well as the parameters stored 
in the node; the value returned by the terminal definition function is 
returned as the result of the node’s evaluation or representation method. 


e A function node has to store not only references to its child nodes and 
a function definition, but also the index of the function’s variant. So, 
when it comes to evaluating a function node or compiling its string 
representation for a given data base and a specific sample index, the 
children nodes are first evaluated with the given data and then the 
referenced function is called with the children’s returned values and the 
variant index stored. The result of this function call is then returned as 
the result of the node’s evaluation. 
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As we have described in Section 9.2.3.3, some functions also consider 
previously calculated values. So, function nodes additionally have to 
manage history arrays in which the calculated values are stored and 
which are also given to the function definition at the next sample’s 
evaluation. 


9.2.4.2 Initialization, Crossover, and Mutation 


The initialization of structure trees is essentially the compilation of ran- 
dom tree structures referencing to randomly chosen function and terminal 
definitions. Of course, all constraints given by the functions basis have to be 
considered: 


e The number of children of each function node has to fulfill the arity 
constraints given by the function definition parametrization; in the case 
of fixed arity the number of children has to be exactly this value, and 
in the case of variable arities the number of children may not fall below 
the minimum or rise above the maximum arity limit. 


e Of course, parent and child constraints also have to be considered. 


e The structure complexity given in the problem representation (regarding 
height and size of structure trees) may not be exceeded. 


e Variable indices are chosen according to variable availabilities; sample 
offsets are initialized according to minimum and maximum sample off- 
sets defined by the problem instance. 


e Coefficients of terminal nodes are initialized according to parameter set- 
tings defined by the terminal definition. 


The most frequently used crossover operator is the single-point subtree 
exchanging crossover already described in detail in Section 2.2.1.3. Subtrees 
are exchanged and new formulas are formed; the references to the function 
and terminal definitions are copied into the new solution candidate. Figure 
9.13 illustrates this mechanism. 

Of course, all constraints defined by the functions basis have to be satisfied 
here, too. Especially child and parent relations of the new combinations have 
to be checked and invalid constellations avoided. The complexity limitation 
requirements given by the problem instance also have to be fulfilled. 

In fact, we have implemented and use three different types of crossover oper- 
ators: 


e The standard crossover variant chooses subtrees without considering 
their size. 


e The low level crossover variant tries to exchange rather small subtrees 
of height 1 or 2, e.g. 
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Functional Basis 


parent? i parent 2 


FIGURE 9.13: Structure tree crossover and the functions basis. 


e The high level crossover variant tries to exchange rather big subtrees as 
for example the roots’ children. 


Finally, mutating a structure tree can be done in several different ways. 
Some structural as well as parametric mutation variants are as follows: 


e A subtree could be deleted or replaced by a randomly re-initialized sub- 
tree. 


e A function node could for example change its function type or turn into 
a terminal node. 


e A terminal node representing a variable could for example change its 
index and thus in the following refer to another variable. 


e A terminal node representing a constant could be multiplied with a fac- 
tor. A good choice for the distribution of these multiplicative mutation 
factors could be a Gaussian distribution with average 1.0 so that the 
probability of smaller changes is greater than the probability of larger 
modification. 


Up to now we have always stressed the fact that complexity limitations are 
given in the problem representation of the concrete system identification prob- 
lem at hand. In fact, complexity limitations can also be defined by crossover 
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and mutation operators; these operators can be parameterized so that they 
produce models by crossing parents or mutating formulas that fulfill size or 
height restrictions independently of the settings given in the problem. These 
limitations could for example also be modified during the execution of the GP 
process. 


9.2.5 Solution Evaluation 
9.2.5.1 Standard Solution Evaluation Operators 


The primary task of an evaluation operator estimating the fitness of a sys- 
tem identification solution candidate is surely to measure how well the values 
calculated using the model fit the original target values. Numerous different 
evaluation functions are possible and have been reported on in the literature; 
in principle, the estimated values e (calculated by evaluating the model on the 
given data basis) are compared to the original target values o. In this context 
it is irrelevant for the function whether the model is evaluated on training, 
validation, test, or any other data partition. 

Here we describe three rather simple functions that have also been imple- 
mented as evaluation operators for HeuristicLab: 


e The mean squared errors function (MSE) has already been described 
in Sections 2.4.3 and 9.1: The function returns the average value of the 
squared residuals of e and o: 


N 
1 
MSE(e, 0) Says ei — 0i)"; N = |e| = |o] (9.5) 


e The coefficient of determination (R?) function can be used for measur- 
ing the proportion of a variable’s variability that is accounted for by 
the model that tries to explain it; it can also be seen as the ratio of the 
variability of the modeled target values to the variability of the origi- 
nal target values. R? of original and modeled target values, o and e, 
respectively, is defined as 


R?(e,0) = 1 - ==; (9.6) 


SSz = X (0i — €i)’, SSr = X. (0; — 0)’, (9.7) 


i=1 i=1 
= FN = |o| (9.8) 


where SSp stands for the explained sum of squares and S'Sr for the 
total sum of squares of the original values. The better a model is, the 
more the R? value converges to 1. 
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e The variance accounted for (V AF) function is defined as the fraction of 
the variances of the residuals and the original target values: 


(9.9) 


big fon od 
var(a) = ma (xi —Z), f= Fate = |] (9.10) 


The variance of the residuals, i.e., the differences between the original 
and modeled values, is so divided by the original values’ variance; the 
smaller the residuals’ variance is, the nearer the calculated value con- 
verges towards 1 

This main difference of this evaluation function compared to other ones 
as for example mse or R? is that it does not punish constant residu- 
als; only the variance of the residuals is taken into account and might 
decrease a model’s quality. 


In the implementations of these evaluation functions we have introduced a 
parameter for limiting the maximum contribution of a single sample’s error to 
the total evaluation. The residual of each specific sample can so be limited in 
relation to the original target values’ range; this is supposed to help to cope 
with outliers and invalid values calculated by division by 0, for example. 


9.2.5.2 Combined Solution Evaluation 


Several advanced evaluation concepts are also realized in an advanced eval- 
uation operator for HeuristicLab. Again, for the explanations given in this 
section let o be the original and e the estimated target values, and N the 
number of samples analyzed; furthermore, let range(o) be the range of the 
original target values: 


range(o) = max(o) — min(o) (9.11) 


First, instead of mean squared errors we use the mean exponentiated error 
function; the residuals are raised to the power of n, a parameter of this par- 
ticular evaluation function, and the mean value of these exponentiated errors 
is calculated: 


MEE(o0,e,n) ty le; — o;|"; N = |e| = |o| (9.12) 


Additionally, this operator is able to combine the evaluation functions given 
in the previous section; a combined fitness value is calculated as a linear 
combination of the three separate fitness values. 

First, the fitness values MEE(o,e,n), R?(0,e), and VAF (o,e) have to be 
scaled so that they have comparable ranges. The exponentiated errors are 
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scaled by dividing them by a fourth of the target values’ range, so for calcu- 
lating the scaled fitness value MSEE"(o,e,n), MSE(o,e,n) is divided by a 
fourth of the target data’s range raised to the power of n since 


N n 
1 le; — oil 
/ = — $ == = 
MEE'(o,e,n) = N > (s) ; N = le| = |o| (9.13) 
1 Ta 
/ 2 5 l . jn 
MEE (0,e,n) = N (=) = le; = oil (9.14) 
1 n 
MEE'(o, en) = MEE(o, en) x (=) (9.15) 
4 


where n is the exponent chosen for raising the errors to the power of n. 
The scaled values R?” (o,e) and V AF' (o,e) are calculated as simply as 


R”'(0,e) = 1 — R?(0,e), V AF' (o,e) = 1 — VAF (o,e) (9.16) 


since the range of the R? and VAF functions is [0,1], anyway. 

The minimum and maximum residuals rmin(o,e) and rmaz(0,e) can also 
be considered; before using them in the combined fitness function, they are 
scaled in the same way as the MEE values: 


r = e — 0; Tmin (0, €) = Min(r),  maz(0, €) = maz(r) (9.17) 
Tmin’ (0, €) = "ES, Tmas’ (0, €) = "peet? (9.18) 
ae aa aa 


All these scaled partial fitness contribution values are multiplied with coef- 
ficients c1, C2, C3, C4,C5, summed, and the result divided by the sum of coeff- 
cients; the result is returned as the combined fitness value COM B(o, e,n, c): 


ay = cı x MEE’ (0, e, n) 
a2 = Co * R?"(o, e,n) 
a3 = c3 * VAF"(0,e,n) 
a4 = C4 * Tmin (0, €) 
as = C5 * Tmax (0, €) 


a1 + a2 + a3 + a4 + a5 
COM B(o, e, n, c) = ——— 
Cy + C2 + C3 + C4 + C5 


There are, in fact, even more sophisticated evaluation operators to be de- 
scribed, namely a time series analysis specific one as well as a classification 
specific one. These are about to be discussed in Sections 11.1 and 11.2. 


9.2.5.3 Adjusted Solution Evaluation 


A modification of the coefficient of determination function R? is the so- 
called adjusted R?; when evaluating a model m, then this extension of the R? 
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function described above also takes into account the number of explanatory 
terms of the model. Let N be the sample size, t the number of terms in m, 
and o and e again the original and estimated target values, so R?qq;(0,e) is 
calculated as 


N-1 


Dal —1_(1_— R2 Dn 
R aqj(0,e)=1-— (1 Re) 


(9.25) 

This add-on’ increases the calculated quality value only if the addition of 
a new term to the model improves the model’s performance more than what 
would be expected by chance; unlike R? it can even become a negative value. 

We have adapted this concept in a slightly modified manner so that it is 
applicable to the partial R? and V AF evaluations of the combined evaluator 
COMB described in Section 9.2.5.2. These partial evaluation results can 
be optionally corrected using the factor eH where s is the model’s size, 
i.e., the number of nodes of the structure tree solution representing the model 
which is to be evaluated. So, the adjusted evaluation results aaj and V AFadj 
are calculated as 


t (9.26) 
R?aaj(0,e) = 1 — (1 — R? (0,e)) * q (9.27) 
VAFaa(o,e) = 1 — (1 — VAF (0,e))*q (9.28) 


9.2.5.4 Runtime Consumption Considerations 


As we have now described all basic genetic operators for data-based system 
identification using genetic programming, we can try to estimate their relative 
runtime consumption. 

The initialization of structure trees is not just called only once, it is also rel- 
atively cheap in terms of runtime consumption. This is because nodes, which 
are relatively small entities, are created according to the rules and limitations 
given in the problem instance and the functions basis; the connection between 
nodes is established by references (pointers) from parent to child nodes. 

Crossover and mutation are in our case also very inexpensive with respect 
to runtime and memory consumption. Nodes and references are copied and 
parameters are modified; only in case of the creation of invalid structure trees 
it could be that repair routines have to be used which could, if implemented 
in a suboptimal way, cost significant runtime. 

Anyway, it boils down to the fact that most of the runtime of a GP based 
system identification process is consumed by the evaluation of solution can- 
didates. This is because models have to be evaluated on the training (and 


TOf course, calling this modification an “add-on” may sound a bit misleading as it is no 
additive but rather a multiplicative one. The reader is asked to be so kind as to forgive this 
slight rhetorical incorrectness. 
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maybe also validation) data, i.e., on possibly hundreds or thousands of sam- 
ples. Collecting these values and then calculating the fitness value can be again 
relatively cheap (with respect to runtime consumption) when using rather sim- 
ple evaluation functions as those summarized in the Sections 9.2.5.1, 9.2.5.2, 
and 9.2.5.3; still, especially when using more complex functions as for exam- 
ple time series analysis or classification specific ones given in Sections 11.1 
and 11.2, then this part of the evaluation also might cause noticeable runtime 
consumption. 

In HeuristicLab, for instance, we have measured that even when using a 
graphical user interface with results display and solution protocolling, more 
than 99.5% of the algorithm’s runtime are consumed by evaluation operators. 


9.2.5.5 Early Stopping of Model Evaluation 


So, what can we do to fight this problem of high computational costs of 
GP-based structure identification? The simplest answer would be to decrease 
the size of the training (and validation) data partitions. Of course this is not 
a generally applicable way to do this; training data should include as much 
information as possible in a preferably efficient way - it should be as small as 
possible, but at the same time also as extensive as necessary. 

Sampling, i.e., evaluating the models not on the total training / validation 
data sets but only on certain selected samples seems to be a better idea: 
By only evaluating the models for a number of (at best randomly) selected 
sample indices, the total quality is estimated. This on the one hand surely 
decreases runtime consumption and on the other hand also might help to 
avoid overfitting as the models are evaluated on different samples at each 
evaluation step (so that they cannot be fit too closely to a set of samples). 
Still, the quality measurement might so become somehow instable; a model 
might be assigned completely different quality values each time it is evaluated 
because the samples chosen are likely to differ. 

When using offspring selection as described in Chapter 4 there is even a 
possibility of how to speed up the evaluation without decreasing the quality 
of the fitness estimation method: 

During the offspring selection phase, solution candidates are compared to 
their parents, i.e., their quality values are compared to their parents’ fitness 
values. In the case of applying most restrictive settings, i.e., when the success 
ratio is set to 1.0, then models are inserted into the next generation’s popula- 
tion only if they fulfill the quality requirements given by the parent’s quality 
values and the comparison factor; there is no pool of possible lucky losers, 
solution candidates that do not fulfill the given fitness criterion are discarded. 
In this case the evaluation of a model can be aborted as soon as it is clear 
that the fitness value will surely not satisfy the fitness criterion even if the 
rest of the evaluation produces no additional errors. 

The issue, then, is how to detect when the evaluation of a model can be 
aborted without decreasing the quality estimation’s accuracy with respect to 
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the total GP process. We introduce a relative calculation interval size rcis 
which is a value in the interval [0,1] (normally a value as for example 0.1, 0.2, 
or 0.5) used in the following way: 

Let m be a model which is to be evaluated for a system identification 
problem p; furthermore let p1 and p2 be the parents of m, and qpı and qp2 
their respective quality values. The given comparison factor cf is then used for 
calculating the comparison value cv depending on whether p is a maximization 
or a minimization problem: 


qmin = MIiN(Gp1, Wp2); Imax = MAX (G1, p2) (9.29) 
arange = ldp1 = qp2| (9.30) 


Sie { qmin + Grange *Cf : isMaximizationProblem(p) (9.31) 


Qmax — Grange *Cf : isMinimizationProblem(p) 


In system identification we normally deal with minimization problems when 
using the MSE, MEE, or COMB evaluation operator as smaller fitness 
values are favored; when using the R? or VAF operator, p can be considered 
a maximization model since better models are assigned higher fitness values. 

Let us now assume that N samples are to be evaluated; mathbfo are then 
the N original target values, and the calculation samples interval csi is calcu- 
lated as 


csi = N xrcis (9.32) 


The vector of estimated target values mathbfe is initialized as a copy of 
mathbfo; the model’s quality qm is initially set to the worst possible fitness 
value (-maxVal for maximization, maxVal for minimization problems), and 
the indices i; and ig are set to 1.8 

As long as qm is “better” than cv (i.e., smaller if p is a minimization and 
greater if p is a maximization problem), the following evaluation steps are 
executed: 


1. The index iz is set to ių + csi — 1; if ig > N, then ig := N. 


2. The estimated values e; are calculated for j = [i...i2]; these replace 
the values at the respective indices in e so that 


e = [€1,..-, Cig; Oin41,---5 ON] (9.33) 
3. dm is calculated using the given fitness function f: 


Gn = f (0, e) (9.34) 


8In this description we again use one-based indexing; in most modern programming lan- 
guages as C, C++, Java, or C#, zero-based indexing would be used instead. 
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4. Now there are several ways how the evaluation is continued: 


(a) If i2 is equal to N, i.e., if all samples have been considered, then 
m is assigned the fitness value gm. 


(b) Otherwise, if qm is no more “better” than cf, then the evaluation 

of m can be aborted and m can be assigned the worst possible fit- 
ness value (maxVal for maximization, —mazVal for minimization 
problems). 
As an alternative, we can also assign m an extrapolated fitness 
value: If p is a minimization problem and the optimal possible 
fitness value 0, as is the case if we use the MSE, MEE, or 
COMB operator, then we can assign m the extrapolated fitness 
value qm * x. 


(c) Otherwise, go back to step 1 and continue the evaluation of m. 


By rearranging the evaluation as described above we guarantee that the 
quality of models that fulfill the given offspring selection criterion is accurate 
and calculated in the same way as when using the standard procedure. For 
models that perform worse than demanded and are therefore not about to 
fulfill the offspring selection criterion, the evaluation is aborted as soon as it 
is clear that the evaluation will result in such a “bad” fitness value. 

Thus, a lot of runtime can be saved. For the sake of completeness we of 
course have to admit that the runtime consumption is increased slightly for 
models that are evaluated on all samples since intermediate fitness values are 
calculated; still, this minor drawback is accepted as the advantages outweigh 
by far. 


9.3 Local Adaption Embedded in Global Optimization 


Genetic algorithms and genetic programming are in general global opti- 
mization methods, i.e., their aim is to search the whole search space in an 
intelligent way in order to find the (or an) optimal solution. In contrast to 
this, local optimization methods are local search algorithms, which means that 
they move from solution to solution and so search the search space until a so- 
lution considered optimal is found (or a time-out condition is fulfilled). Well 
known examples for local search algorithms are the hill climbing algorithm 
and tabu search; please see [RN03] and [GL97] for respective explanations 
and discussions. 

In biology, an organism’s positive characteristic that has been favored by 
natural selection is called adaption [SG99]. This is, in fact, the central concept 
in evolutionary biology and of course also in evolutionary computation. 
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In this section we shall summarize local adaptation concepts we have intro- 
duced into the genetic programming process, namely parameter optimization 
as well as model structure pruning. 


9.3.1 Parameter Optimization 


Parameter estimation has already been mentioned in connection with clas- 
sical system identification: After determining and fixing the structure of the 
model, appropriate parameters have to be estimated on the basis of empirical 
data. 

In GP, the genetic process is supposed to identify the set of relevant vari- 
ables, the formula structure, and appropriate parameters automatically; there 
are no explicit parameter estimation phases planned in the standard GP pro- 
cess. Furthermore, GP is very flexible regarding function and terminal defi- 
nitions as well as formula structures; it is not easy to formulate general pa- 
rameter optimization methods for arbitrary nonlinear model structures. 

Still, in GP we have to face the problem that often models with good 
structures are assigned bad fitness values due to disadvantageous parameters 
such as coefficients or time lags. This is the reason why we have implemented 
a parameter optimization method based on evolution strategy (ES) concepts. 


Evolution strategy is an optimization technique whose ideas are based on 
the natural concepts of evolution and adaption; it was created and devel- 
oped since the 1960s, primarily by a German research community around 
Rechenberg and Schwefel ({[Rec73], [Sch75], [Sch94]). As it is an evolution- 
ary algorithm, the optimization process based on ES is executed by applying 
operators in a loop, i.e., main operations are applied on the solution candi- 
dates repeatedly until a given termination criterion is met. A comprehensive 
overview of the theory of ES can for example be found in [Bey0]]. 

There are several similarities of evolution strategies and genetic algorithms 
or genetic programming; as they are all optimization methods based on evo- 
lution, they are also considered the main representatives of evolutionary al- 
gorithms (EAs). Still, there are some important differences of ESs and GAs, 
the most important being as follows: 


e Solution candidates are in ESs represented as vectors of real-valued pa- 
rameters. 


e The main factors that drive evolution in ESs are mutation and selection. 
Whereas GAs use mutation only for avoiding stagnation, mutation is the 
main reproduction operator in evolution strategies: Each component of 
the parameter vector is mutated individually in each generation. An 
additive mutation is carried out, and small mutations are more likely 
than big ones. 
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e In addition to mutation, recombination can be used to create new indi- 
viduals out of two parents, too. 


e In contrast to nature and GAs, the selection of ESs works in a totally 
deterministic way: In each generation only the best individuals survive, 
whereas in GAs better individuals (normally) just have higher likelihood 
to be considered for producing new solution candidates. 


In each generation of the execution of an ES, à individuals (children) are 
(by mutation and optimal recombination) created out of u individuals of the 
current population. Depending on the chosen strategy, the js members of the 
new generation’s population are selected from all u + A candidates (which is 
referred to as the (u + A)-ES) or only from the A children for A >> js (which is 
also called the (u, A)-ES model). This procedure is repeated until termination 
criterion is reached, normally a maximum number of iterations or a state in 
which no more improvement can be reached. 

In Section 9.1.2 we have shown the general form of a polynomial model 
which is characterized by its order and coefficients: 


y = ao + aix + ax? +...+an2” (9.35) 


In this case, the optimization of the model’s parameters is the task of finding 
appropriate coefficients ao ...an. In the much more general point of view in 
our GP-based approach, the parameters of a model contain a lot more; in 
fact, all parameter settings of the terminal nodes included in the model are 
also parameters for the formula which can be optimized without changing the 
model’s structure. 

For each terminal used in our GP approach, the following parameters are 
to be considered: 


e The variable index, i.e., the number of the variable which is referenced. 


e The coefficient, a value which can be used for multiplying the refer- 
enced variable’s value with a given constant; this constant can be either 
real-valued or integral, and its distribution either uniform (defined by 
minimum and maximum) or Gaussian (defined by mean and variance). 


e The time offset, a value which can be used for referencing to the vari- 
able’s values shifted by a certain number of samples. 


Thus, when it comes to optimizing a model m containing t terminal nodes, we 
have to consider 3*t that could be manipulated by the optimization method. 

As mutation is (besides selection) the most important factor in ES, we 
shall now discuss how mutation with respect to a model’s parameters can be 
applied. A parameter o is used for controlling the strength of mutation; we 
here see o simply as the standard deviation of the modification added to the 
model’s parameter values. Thus, each parameter of the model’s parameters is 
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modified, where again smaller modifications are more likely than bigger ones; 
variable index changes are also to be applied rather seldom (for 20% of the 
terminals, e.g.). 

So, the whole parameter optimization procedure we have implemented for 
optimizing a given model m using the parameters A, o, itmaz and Cfmaz is 
executed in the following way: 


1. Collect all terminals of m in t. 
2. Create À copies of m, in the following called mutants. 


3. Mutate all A mutants individually; for each terminal of the mutant mod- 
els 


e mutate the coefficient, 
e mutate the time offset, and 


e with a rather small probability mutate the variable index. 
4. Evaluate all A mutants. 


5. Optionally adjust o, a parameter steering the mutation’s variance, ac- 
cording to Rechenberg’s success rule. 


6. If any of the mutants is assigned a better quality value than m, then m 
is replaced by the best mutant, and 


e If the number of iterations has reached a given limit (itmaz), the 
algorithm is terminated and m is returned as the optimized version 
of the originally given formula. 


e Otherwise, the procedure is repeated starting again at step 1. 


7. Otherwise, we consider this iteration a failure. If a predefined number 
of consecutive failures Cfmaxy is reached by performing unsuccessfully 
for cfmax times in a row, the algorithm is terminated; otherwise the 
procedure is repeated starting again at step 1. 


As we here always work on one particular model which is to be optimized 
and create à mutants, this algorithm can be seen as a variant of the (1+A)—ES 
algorithm. 

Obviously, the main advantage of this algorithm is that it can be applied 
to any kind of model without any restrictions regarding its structure or the 
given data basis. But, of course the major drawback of this procedure is 
its immense runtime consumption due to the high number of models that 
have to be evaluated for improving the parameters of one single model of the 
GP population. The use of a smaller data set (or the validation set which is 
normally also smaller than the training data sample) for evaluating the models 
can help to fight this problem, but still the use of this parameter optimization 
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concept has to be thought out well and the parameters (o, A, itmax, and 
Cfmax) set so that the runtime consumption does not get completely out 
of hand. As we will show in the test series analysis in later sections, this 
parameter optimization method does not have to be applied in every round of 
the GP process, and also not to all models in the population; partial use can 
help to control the additional runtime consumption and still use the significant 
benefits of this procedure. 


9.3.2 Pruning 
9.3.2.1 Basics and Method Parameters 


Whenever gardeners and orchardists talk about pruning, then they most 
probably refer to the act of cutting out dead, diseased, or for any other reason 
unwanted branches of trees or shrubs. Even though this might harm the 
natural form of plants, pruning is supposed to improve the plants’ health in 
the long run. 

In informatics and especially machine learning, this term is used in analogy 
to describe the act of systematically removing parts of decision trees; regres- 
sion or classification accuracy is decreased intentionally and thus traded for 
simplicity and better generalization of the model evolved. Approaches and 
benefits of the techniques used can be found for example in [Min89], [BB94], 
[HS95], or [Man97]. 

Obviously, the concept of removing branches of a tree can be easily trans- 
ferred to GP, especially when we deal with tree-based genetic programming. 
Several pruning operators have already been presented for GP, see for example 
[ZM96] [FP98], [MK00], or more recent publications such as [dN06], [EKK04], 
[DH02], [FPS06], [GAT06]. In GP, pruning is often considered valuable be- 
cause it helps to find more general and not over-parameterized programs; it is 
also referred to as an appropriate anti-bloat technique as described in Section 
2.6 or [LP02], e.g. 

In the case of fixed functions bases, pruning can also include the detection of 
really ineffective code or introns, i.e., code fragments that do not contribute to 
the program’s (or, as in our case, model’s) evaluation. For example, simply by 
using basic algebraic analysis, a simplification mechanism for formulas would 
be able to detect that -(+(x;4);4) is equal to +(+(x;4) ;-4) due to basic 
knowledge about subtraction and addition, and that this is again equal to 
+(x;4;-4). This then can be easily simplified to x as it is easy to implement 
a simplification program “knowing” that the addition of any value x and its 
negative counterpart —2 is always 0, and that 0 is the neutral element of the 
addition function. 

But, as soon as such a fixed functions basis is not available anymore, things 
start to become a lot more complicated. We shall here describe pruning 
methods suitable for use in combination with a flexible and parameterizable 
set of function and terminal definitions as described in Section 9.2.3. We 
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hereby try to consider the gain of simplicity as well as the deterioration of the 
model’s quality caused by pruning it: 


e The gain of simplicity with respect to the pruning of a model can be 
calculated by comparing its original tree complexity and the complexity 
of the pruned structure tree. The complexity of a model m, c(m), can 
hereby be equal to the size or the height of the tree structure representing 
m. 

So, we calculate the model complexity decrease, mcd(m, mp), of a model 
m and a pruned version of m, mp, as 


c(m) 
c(mp) 
Pruning a model by deleting subtrees will therefore always result in a 
mcd value equal to or greater than 1 as the original model’s complexity 


(in terms of size or height of the tree structure) will always be greater 
than or equal to the pruned model’s complexity. 


(9.36) 


med(m, Mp) = 


e The deterioration of model caused by pruning, deter(m, Mp), can be 
measured by calculating the ratio of the pruned model’s quality q(mp) 
and the quality of the original formula g(m) as 


q(mp) 
deter(m, Mp) = 9.37 
Thus, if for example the pruned model’s fitness value is 10% higher, 
i.e., worse than the original model’s quality with respect to a given 
evaluation operator, then the resulting deterioration coefficient will be 
equal to 1.1. 


Please note that this approach yields reasonable results only when us- 
ing a minimization approach, i.e., if better models are assigned smaller 
quality values as is the case with the M SE function, for example. If the 
evaluation operator applied behaves reciprocally, i.e., if for example the 
R? or VAF function is used, then the reciprocal value of deter(m, mp), 


1 : : A 
deter(m,m,)? is to be used instead. 


These measures for the effect of pruning, namely the complexity reduc- 
tion as well as the quality deterioration, are now used for parameterizing the 
effective pruning of models: 

As we have stated above, accuracy is traded for simplicity, and now we 
are able to quantify this trading aspect. By giving an upper bound for the 
relation between the coefficients expressing the complexity deterioration and 
the simplification effects, the pruning mechanism can be limited; we call this 
composed coefficient cp(m, mp) and limit it with the upper bound cpmaz de- 
manding that 

deter(m, mp) 


— < . 
ea cp(m, Mp) < CPmax (9.38) 
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Thus, we demand that decrease with respect to the model’s quality shall not 
be worse than the simplicity gain multiplied with a certain factor cDmaz- 

Still, there is one major problem with this approach as tremendous loss of 
quality, as for example an increase of the mean squared error by a factor of 
50, might be compensated by replacing a formula mı consisting of 60 nodes 
by one single constant, i.e., a model mə with only one node: 


(m2) 
deter(m1, M2) TG 50 50 
cp(m1, m2) = “med(m1, m2) = a) = T =— <1 (9.39) 
c(m2) 


So, in order to cope with this potential problem — it is in fact really a 
problem since we do not want to replace all models with constant terminals — 
we give a second parameter for the pruning method which limits the quality 
deterioration, detmaz, and so demand that 


q(myp) 


deter(m, mp) = a) 


< detmar @ q(Mp) < detmax * g(m) (9.40) 


9.3.2.2 Pruning a Structure Tree 


The actual pruning of a model (with respect to one particular part of the 
model) in GP is rather easy as it simply consists of removing a sub-tree from 
the tree structure representing the formula. In the case of pruning the root 
node the model thereafter is simply a terminal representing the constant 0; 
otherwise the resected subtree is to be replaced by a constant representing 
the respective parent’s neutral element for the respective input index. For 
example, pruning inputs of an addition results in the replacement of these 
branches by zeros, whereas children of multiplication functions have to be 
replaced by constants representing 1.0. 

Furthermore, pruning could also include the excision of certain parts of the 
model, i.e., a part of a tree could be simply cut out and replaced by one its 
descendants. 

Simple examples are shown in Figure 9.14: In the left part (a) we schemat- 
ically show the replacement of the second input of an addition resulting in 
the insertion of the constant 0, in the middle (b) we see the replacement of 
a multiplication’s first input by the constant 1, and in the right part (c) we 
see possible effects of excising two nodes and replacing them by either of their 
two descendants. 

So, as we now know how models are pruned in general as well as what we 
want a pruning method to achieve, we will describe two pruning methods we 
have designed and implemented as operators for HeuristicLab: The first one 
is an exhaustive implementation that systematically tries to prune the model 
as much as possible, whereas the second one is inspired by evolution strategy 
for reducing runtime. 
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FIGURE 9.14: Simple examples for pruning in GP. 


9.3.2.3 Exhaustive Pruning 


When applying exhaustive pruning to a given model m we have to proceed 
in the following way: For each possible subtree up to a given height hı we 
create a copy of m and remove the respective branch. Furthermore, for each 
internal model fragment (tree) up to a given height hə we create a copy of 
m and cut out the respective fragment. After doing so, the resulting pruned 
models’ qualities are calculated and their complexities are checked; if a pruned 
model meets the requirements regarding maximum deterioration and maxi- 
mum coefficient of simplification and deterioration, then we go on with the 
procedure using this pruned formula. This routine is repeated until no more 
pruned model that meets the given requirements can be produced by deleting 
branches. 

Finally, the algorithm’s result is either the minimal model meeting the 
given requirements, or that model for which the minimal cp coefficient is 
calculated. This decision is controlled by the parameter minimizeM odel 
denoting whether the minimal formula is to be returned or, if this flag is set 
false, the model with the minimal cp value is to be considered the result of 
pruning m. 

In a bit more formal way we can describe this exhaustive pruning algorithm 
as is given in Algorithm 9.1. 

Exhaustive pruning is of course an extremely expensive method with respect 
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Algorithm 9.1 Exhaustive pruning of a model m using the parameters hı, 
h2, minimizeM odel, CPmaz, and detmaz. 
Initialize Meurr as clone of m, 
Evaluate m, store calculated fitness in f 
Calculate complexity of m, store result in c 
Initialize abort = false 
while not(abort) do 
Initialize set of pruned models M 
Initialize structure tree t as tree representation of Meurr 
for each branch b of t with height(b) < hı do 
Initialize Mimp as clone of Meurr 
Remove b’, the corresponding branch to b in Mimp 
Evaluate Mimp, store calculated fitness in fimp 
Calculate complexity of Mimp, store result in Cemp 
Calculate model complexity decrease mcd = c/ Ctmp 
Calculate quality deterioration det = f / fimp 
if det < detmaz A mcd < CPmaz then 
Insert Mimp to M 
end if 
end for 
for each internal sub-tree st of t with height(st) < hz do 
for each descendant d of st do 
Initialize Mimp as clone of Meurr 
Replace st’, the corresponding part to st in Mimp, by d 
Evaluate Mimp, store calculated fitness in fimp 
Calculate complexity of Mimp, store result in Cimp 
Calculate model complexity decrease mcd = c/Cimp 
Calculate quality deterioration det = f / fimp 
if det < detmaz \ mcd < CPmax then 
Insert Mimp to M 
end if 
end for 
end for 
if M is empty then 
return Meurr 
else 
if minimizeM odel then 
Set Meurr to that model in M with minimum complexity value c 
else 
Set Meurr to that model in M with minimum mcd coefficient 
end if 
end if 
end while 
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to runtime consumption. As an alternative, a general pruning method inspired 
by evolution strategies is described in the following section. 


9.3.2.4 ES-Inspired Pruning 


As a less runtime consuming pruning method we have designed an ES- 
inspired pruning method: For pruning a model m, we create À clones of m 
and prune those randomly; again, we use parameters hı and hə that limit the 
size of the branches and internal subtrees that are excised. All of the so created 
À pruned mutants are checked and those that fulfill the given requirements 
regarding maximum deterioration and maximum coefficient of simplification 
and deterioration are collected. This procedure is then repeated with the 
best pruned mutant, whereas the best pruned model is again selected as the 
minimal model or the one showing the best coefficient of simplification and 
deterioration. As soon as this procedure is executed without any success for 
a given number in a row, the algorithm is terminated. 

Algorithm 9.2 describes this ES-inspired pruning method in a more formal 
way. 


9.4 Similarity Measures for Solution Candidates 


Genetic diversity and population dynamics are very interesting aspects 
when it comes to analyzing GP processes. Measuring the entropy of a popula- 
tion of trees can be done for example by considering the programs’ scores (as 
explained in [Ros95b], e.g.); entropy is there calculated as — `, pr - log(pr) 
(where pp is the proportion of the population P occupied by population par- 
tition k). In [McK00] the traditional fitness sharing concept from the work 
described in [DG89] is applied to test its feasibility in GP. 

In this section we present more sophisticated measures which we have used 
for estimating the genetic diversity in GP populations as well as among pop- 
ulations of multi-population GP applications. What we use as basic measures 
for this are the following two functions that calculate the similarity of GP 
solution candidates or, a bit more specific, in our case formulas represented 
as structure trees: 


e Evaluation-based similarity estimation compares the subtrees of two GP 
formulas with respect to their evaluation on the given training or vali- 
dation data. The more similar these evaluations are with respect to the 
squared errors or linear correlation, the higher is the similarity for these 
two formulas. 


e Structural similarity estimation directly compares the genetic material 
of two solution candidates: All possible pairs of ancestor and descendant 
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Algorithm 9.2 Evolution strategy inspired pruning of a model m using 
the parameters A, maxUnsuccRounds, hı, ho, minimizeModel, cpmax, and 
detmax- 
Initialize Meurr as clone of m, 
Evaluate m, store calculated fitness in f 
Calculate complexity of m, store result in c 
Initialize Unsuccessful Rounds := 0 
Initialize abort := false 
while not(abort) do 
Initialize set of pruned models M 
Initialize structure tree t as tree representation of Meurr 
fori=1:Ado 
Set r to random number in [0; 1[ 
Initialize Mimp as clone of Meurr 
if r < 0.5 then 
Remove b, a branch of Mimp with height(b) < hi 
else 
Select st, an internal subtree of t with height(st) < ha, 
replace st by a randomly chosen descendant of d 
end if 
Evaluate Mimp, store calculated fitness in fimp 
Calculate complexity of Mimp, store result in Cemp 
Calculate model complexity decrease mcd = c/Cimp 
Calculate quality deterioration det = f / fimp 
if det < detmaz \ mcd < CPmaz then 
Insert Mimp to M 
end if 
end for 
if M is empty then 
Increase Unsuccess ful Rounds 
if Unsuccess ful Rounds = maxUnsuccRounds then 
return Meurr 
end if 
else 
Set Unsuccessful Rounds := 0 
if minimizeM odel then 
Set Meurr to that model in M with minimum complexity value c 
else 
Set Meurr to that model in M with minimum mcd coefficient 
end if 
end if 
end while 
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nodes in formula trees are collected and these collections compared for 
pairs of formulas. So we can determine how similar the genetic make-up 
of formulas is without considering their evaluation. 


9.4.1 Evaluation-Based Similarity Measures 


The main idea of our evaluation-based similarity measures is that the build- 
ing blocks of GP formulas are subtrees that are exchanged by crossover and 
so form new formulas. So, the evaluation of these branches of all individuals 
in a GP population can be used for measuring the similarity of two models 
my, and maz: 

For all subtrees in the structure-tree of model m, collected in t, we collect the 
evaluation results by applying these subformulas to the given data collection 
data as 


V(st; € t)V(j € [1; N]) : eij = eval(st,;, data) (9.41) 


where N is the number of samples included in the data collection, no matter 
if training or validation data are considered. 

The evaluation-based similarity of models mı and mg, es(m1, M2), is cal- 
culated by iterating over all subtrees of mı (collected in tı) and, for each 
branch, picking that subtree of t2 (containing all subtrees of mz) whose eval- 
uation is most “similar” to the evaluation of that respective branch. So, for 
each branch ba in tı we compare its evaluation ea with the evaluation ey, of 
all branches by in t2, and the “similarity” can be calculated using the sum of 
squared errors (sse) or the linear correlation coefficient: 


e When using the sse function, the sample-wise differences of the eval- 
uations of the two given branches are calculated and their sum of 
squared differences is divided by the total sum of squares tss of the 
first branch’s evaluation. This results in the similarity measure s for the 
given branches. 


1 N 
aay > eal] (9.42) 
N N 
sse = X (ealj] — ev|j])?sts8 = X- (ealj] — a)? (9.43) 
Ssse(ba, bp) =J= — (9.44) 


e Alternatively the linear correlation coefficient can be used: 
La 1A 
a= p Leable = yD el (9.45) 


J 
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(eal) — Za) leoli] — z) 


SDE 
zH Dja leali] — ea)? yy Dja leli] — &? 


(9.46) 


Stc(ba, bb) = | 


No matter which approach is chosen, the calculated similarity measure for the 
branches ba and bp, s(ba, bb), will always be in the interval [0; 1]; the higher this 
value becomes, the smaller is the difference between the evaluation results. 

As we can now quantify the similarity of evaluations of two given subtrees, 
we can for each branch ba in ta elicit that branch by in tẹ with the highest 
similarity to ba; the similarity values s are collected for all branches in ta and 
their mean value finally gives us a measure for the evaluation-based similarity 
of the models Mma and mp, es(Ma, Mp). 

Optionally we can force the algorithm to select each branch in tẹ not more 
than once as best match for a branch in t, for preventing multiple contribu- 
tions of certain parts of the models. 

Finally, this similarity function can be parameterized by giving minimum 
and maximum bounds for the height and / or the level of the branches inves- 
tigated. This is important since we can so control which branches are to be 
compared, be it the rather small ones, rather big ones, or all of them. 

Algorithm 9.3 summarizes this evaluation-based similarity measure ap- 
proach. 


Algorithm 9.3 Calculation of the evaluation-based similarity of two models 
mı and mz with respect to data base data 
Collect all subtrees of the tree structure of mı in Bı 
Collect all subtrees of the tree structure of mz in Bo 
Initialize s := 0 
for each branch b; in Bı do evaluate 6; on data, store results in e€1,; 
for each branch b in B2 do evaluate bp on data, store results in e2,k 
for each branch b; in Bı do 
Initialize Smag := 0, index := —1 
if |B2| > 0 then 
for each branch bg in By do 
Calculate similarity Stmp as similarity of bj and by using €1,;, €2,k 
and similarity function Ssse OF Sle 
if Simp > Smar dO Smazx := Stmp; index = k 
end for 
if Prevent MultipleContribution do remove bindex from Bə 
end if 
S := S + Smas 
end for 
return s/|Bı| 
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9.4.2 Structural Similarity Measures 


Structural similarity estimation is, unlike the evaluation-based described 
before, independent of data; it is calculated on the basis of the genetic make- 
up of the models which are to be compared. 

Koza [Koz92b] used the term variety to indicate the number of different 
programs in populations by comparing programs structurally and looking for 
exact matches. The Levenshtein distance [Lev66] can be used for calculating 
the distance between trees, but it is considered rather far from ideal ([Kei96}, 
[O’R97], [LP02]); in [ENOO] an edit distance specific to genetic programming 
parse trees was presented which considered the cost of substituting between 
different node types. 

A very comprehensive overview of program tree similarity and diversity 
measures has been given for instance in [BGK04]. The standard tree struc- 
tures representation in GP makes it possible to use more fine-grained struc- 
tural measures that consider nodes, subtrees, and other graph theoretic prop- 
erties (rather than just entire trees). In [Kei96], for example, subtree variety 
is measured as the ratio of unique subtrees over total subtrees and program 
variety as a ratio of the number of unique individuals over the size of the 
population; [MH99] investigated diversity at the genetic level by assigning 
numerical tags to each node in the population. 

When analyzing the structure of models we have to be aware of the fact 
that often structurally different models can be equivalent. Let us for example 
consider the formulas *(+(2,X2) ,+(X3) and +(*(X2,X3) ,*(X3,2)): As we 
know about distributivity we know that these formulas can be considered 
equivalent, but any structure analysis approach taking into account size, shape 
or parent / child relationships in the structure tree would assign these models 
a rather low similarity value. This is why we have designed and implemented a 
method that systematically collects all pairs of ancestor and descendant nodes 
and information about the properties of these nodes. Additionally, for each 
pair we also document the distance (with respect to the level in the model 
tree) and the index of the ancestor’s child tree containing the descendant node. 
The similarity of two models is then, in analogy to the method described 
in the previous section, calculated by comparing all pairs of ancestors and 
descendants in one model to all pairs of the other model and averaging the 
similarity of the respective best matches. 

Figure 9.15 shows a simple formula and all pairs of ancestors and descen- 
dants included in the structure tree representing it; the input indices as well 
as the level differences (“level delta”) are also given. Please note: The pairs 
given on the right side of Figure 9.15 are shown intentionally as they sym- 
bolize the pairs of nodes with level difference 0, i.e., nodes combined with 
themselves. 

We define a genetic item as a 6-tuple storing the following information about 
the ancestor node a and descendant node d: 


e typea, the type of the ancestor a 
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FIGURE 9.15: Simple formula structure and all included pairs of ancestors 
and descendants (genetic information items). 


typea, the type of the descendant d 


ôl, the level delta 

e index, the index of the child branch of a that includes d 
© npa, the node parameters characterizing a 

© npa, the node parameters characterizing d 


where the parameters characterizing nodes are represented by tuples contain- 
ing the following information: 


e var, the variant (of functions) 

e coef f, the coefficient (of terminals) 
e to, the time offset (of terminals) 

e vi, the variable index (of terminals) 


Now we can define the similarity of two genetic items gi; and gig, s(gi1, giz), 
as follows: 

Most important are the types of the definitions referenced by the nodes; if 
these are not equal, then the similarity is 0 regardless of all other parameters: 


V(gi1, giz) : gi1.typea F gi2.tyYpea => 8(gt1, gi2) = 0 (9.47) 
V(gi1, giz) : gi1.typea # gio.typea => 8(gi1, gi2) = 0 (9.48) 


If the types of the nodes correspond correctly, then the similarity of gi; and 
giz is calculated using the similarity contributors s1... S10 of the parameters 
of gi; and giz weighted with coefficients c1 ... cio. 

The differences regarding input index, variant, and variable index are not 
in any way scaled or relativized; their similarity contribution is 1 in the case 
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of equal parameters for both genetic items and 0 otherwise. The differences 
regarding level difference, coefficient, and time offset, on the contrary, are 
indeed scaled: 


e The level difference is divided by the maximum tree height heightmaz, 


e the difference of coefficients is divided by the range of the referenced 
terminal definition (in case of uniformly distributed coefficients) or di- 
vided by the standard deviation o (in case coefficients are normally 
distributed), and 


e the difference of the time offsets is divided by the maximum time offset 
allowed of fsetmax- 


V(gi1, giz : gii.typea = giz.typeakgir.typea = giz-typed) : (9.49) 
\gi1.6l T gi2.ôl]| 
= ] — = 9.50 
a: heightmax ( ) 
_ J gii.inder F gig.indexr : 0 
Pa eae = gin.indexr : 1 Qe) 
i= ae 7 gi2.MPa-var ; (9.52) 
911-Npq.VAar = giz.Npa.var: 
A Pe Z giz-npa. var 0 (9.53) 
gi1-Npa.var = gi2.npa.var 1 
ÔCa = |gi1.npa.coef f — giz.npa.coef f| (9.54) 


0Ca 


gti types -max—gt1 .typea.min 
Ca 


gia .typeag.ox4 


= isUniformTerminal(gii.typea) 
eS isGaussianTerminal(gi1.typea) 


55) 
dca = |gi1.npa.coef f — gig.npa.coef f| (9.56) 
0Cq 
gi1 -typeg.maz— gir .typea.min 
Cd 


isUntformTerminal(giy.typea) 
S6 = 1- r r : j 
isGaussianTerminal(gii.typea) 


gti -typea.o*4 


(9.57) 
i \gi1.npa.to — gi2.NPa.to| (9.58) 
sy = 1- AAA a | $ 
i of fsetmax 
|gi1.npa.to — gig.npa.to| 
= ee E ee ieee anata 9.59 
58 of f setmazx ( ) 


Pes tae x gi2-Npa.vi : 0 (9.60) 
Gi1.Npa-Vi = gi2.NPpa. vi : 1 
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ye ae Æ gi2-Npavi : 0 (9.61) 
gii-npa.vi = giz2.npa vi : 1 
Finally, there are two possibilities how to calculate the structural similarity 
of gi; and gig, sim(gi1, gi2): On the one hand this can be done in an additive 
way, on the other hand in a multiplicative way. 


e When using the additive calculation, which is the obviously more sim- 
ple way, sim(gt1, giz) is calculated as the sum of these similarity con- 
tributions s1...10 weighted using the factors c1...10 and, for the sake of 
normalization of results, divided by the sum of the weighting factors: 


sim(gi1, giz) = (9.62) 


e Otherwise, when using the multiplicative calculation method, we first 
calculate a punishment factor p; for each s; (again using weighting fac- 
tors ci, 0 < c; < for all i € [1;10]) as 


y(i € [1;10]) : p; = (1 — si) - ci (9.63) 


and then get the temporary similarity result as 


10 
siMimp(Gt1, gi2) = ii — pi). (9.64) 


i=1 


In the worst case scenario we get d; = 0 for all ¿i € [1; 10] and therefore 
the worst possible simtmp is 


10 10 
simworst = Ile —((1—d;)+¢)) = Ile Te (9.65) 


As simworst is surely greater than 0 we linearly scale the results to the 
interval [0; 1]: 


siMimp(gir, giz) a SiMworst 
1— SIMworst 


sim(gti, gi2) = (9.66) 
In fact, we prefer this multiplicative similarity calculation method since 
it allows more specific analysis: By setting a weighting coefficient c; to 
a rather high value (i.e., near or even equal to 1.0) the total similarity 
will become very small for pairs of genetic items that do not correspond 
with respect to this specific aspect, even if all other aspects would lead 
to a high similarity result. 


Based on this similarity measure it is easy to formulate a similarity func- 
tion that measures the similarity of two model structures. In analogy to the 
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approach presented in the previous section, for comparing models mı and m2 
we collect all pairs of ancestors and descendants (up to a given maximum 
level difference) in mı and mg and look for the best matches in the respective 
opposite model’s pool of genetic items, i.e., pairs of ancestor and descendant 
nodes. As we are able to quantify the similarity of genetic items, we can elicit 
for each genetic item gi, in the structure tree of mı exactly that genetic item 
gix in the model structure ma with the highest similarity to gi; the simi- 
larity values s are collected for all genetic items contained in mı and their 
mean value finally gives us a measure for the structure-based similarity of the 
models mı and m2, sim(m,, M2). 

Optionally we can force the algorithm to select each genetic item of m2 not 
more than once as best match for an item in m; for preventing multiple con- 
tributions of certain components of the models. 


This function is defined in a more formal way using pseudo-code in Algo- 
rithm 9.4. 


Algorithm 9.4 Calculation of the structural similarity of two models mı and 
m2 
Collect all genetic items mı in Gh 
Collect all genetic items mə in Glg 
Initialize s := 0 
for each branch gi; in GJ; do 
Initialize Smag := 0, index := —1 
if |B2| > 0 then 
for each genetic item gik in GI do 
Calculate similarity Stmp as similarity of gi; and gik 
if Simp > Smar dO Smaz = Stmp; index = k 
end for 
if PreventMultipleContribution do remove giindex from GI2 
end if 
§:= S + Smar 
end for 
return s/|GI,| 


Obviously, it is possible that some model contains all pairs of genetic items 
that are also incorporated in another model, but not vice versa. Thus, this 
similarity measure sim(m1, M2) is not symmetric, i.e., sim(m1, m2) does not 
necessarily return the same result as sim(m2, mı) for any pair of models mı 
and mə. 

Of course, this similarity concept for GP individuals cannot be the basis of 
theoretical concepts comparable to those based on GP (hyper)schemata, for 
example; we do here not want to give any statements about the probability 
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of certain parts of formulas to occur in a given generation. In the presence of 
mutation or other structure modifying operations (as for example pruning) we 
are interested in measuring the structural diversity in GP populations; using 
this structural similarity measure we are able to do so. 


Chapter 10 


Applications of Genetic Algorithms: 
Combinatorial Optimization 


Within Chapter 7 the knowledge about the global optimum has been used in 
order to analyze and highlight certain properties of the considered algorithms. 
In case of practical applications of considerable dimension this information is 
not available. 


The analyses described in this chapter do not consider information about 
the genotypes of global optima and are therefore limited to the observation 
of the dynamics of genetic diversity in populations and in subpopulations of 
parallel GA concepts. The main conclusions of Chapter 7 were that it is most 
beneficial for the evolutionary process of genetic algorithms if the essential 
genetic information (the alleles of the globally optimal solution) establishes 
slowly in the population, which is important for gaining high quality results. 
As already indicated in previous chapters, this can be achieved by offspring 
selection. In this chapter results for several benchmark problem instances will 
be reported on in terms of achievable solution qualities, i.e., best and average 
solution qualities. The results for the TSP benchmark problem instances ob- 
tained using standard GAs, GAs with offspring selection, and the SASEGASA 
have been taken from [Aff05] and [AW04b]. Additionally, some characteristic 
aspects of certain algorithm variants are analyzed in greater detail by observ- 
ing the genetic diversity over time similar to the genetic diversity analyses 
reported on in Section 6.2. For the CVRP we have also compared the perfor- 
mance of standard GAs to the performance of GAs with offspring selection. 
By doing so, the observation of genetic diversity over time has again been 
used to point out selected aspects that are representative for the respective 
algorithms when applied to the CVRP. 


Beside the increased robustness of offspring selection described in Chap- 
ter 7 we here also consider the effects of a greater number of subpopulations 
for the SASEGASA. The most important fact is that we can in this con- 
text observe the scalability of achievable global solution qualities by applying 
greater numbers of subpopulations. 


As is shown in this chapter, in this context we can observe that a slow 
decrease of genetic diversity caused by the evolutionary forces supports the 
GA in producing high quality results. 
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10.1 The Traveling Salesman Problem 


All TSP benchmark problems used here have been taken from the TSPLIB 
[Rei91] using updated information! about the best or at least the best known 
solutions. The results for the TSP are represented as the relative difference 
to the best known solution defined as 


ResultQuality 


lativeDi = 
relativeDif ference ae 


— 1) - 100 [%] (10.1) 
All values presented in the following tables are the best and average relative 
differences of five independent test runs executed for each test case. The 
average number of evaluated solutions gives a quite objective measure of the 
computational effort. 


10.1.1 Performance Increase of Results of Different 
Crossover Operators by Means of Offspring Selection 


The first aspect to be considered is the effect of the offspring selection model 
on the quality improvement using different crossover operators. In order to 
visualize the positive effects of the new methods in a more obvious way, we 
also present results that were generated by a standard GA with proportional 
selection, generational replacement, and 1-elitism. 

In Table 10.2 the results achieved with the conventional GA are listed. The 
fixed parameter values that were used for all algorithms in the different test 
runs are given in Table 10.1. 

As we want to see how the algorithmic concepts presented in the first part 
of this book influence the ability of GAs to produce high quality results, 
the effects of offspring selection are here given on the basis of a number of 
experiments which were performed on a single population in order to not dilute 
the effects of offspring selection principles with the effects of the segregation 
and reunification strategies. Table 10.3 recapitulates the results for a selection 
of commonly applied crossover operators suggested for the path representation 
of the TSP ([Mic92], [LKMt99]}) each on its own, as well as one combination 
of more effective crossover operators. 

Remarkable in this context is that also the use of crossover operators, that 
are commonly considered rather unsuitable for the TSP [LKM*99], leads to 
quite good results in combination with offspring selection. The reason for this 
behavior is that in our selection principle only children that have emerged as 
a good combination of their parents’ attributes are considered for the further 
evolutionary process, if the success ratio is set to a higher range. 


lUpdates for the best (known) solutions can for example be found on 
http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/. 
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Table 10.1: Overview of algorithm parameters. 
Parameters for the standard GA 


Generations 


Population Size 


Elitism Rate 
Mutation Rate 


Selection Operator 
Mutation Operator 


(Results presented in Tab. 10.2) 


100,000 


120 
1 
0.05 


Roulette 
Simple Inversion 


Parameters for the offspring selection GA 


Elitism Rate 
Mutation Rate 


Selection Operator 
Mutation Operator 


Success Ratio 


Maximum Selection Pressure 


(Results presented in Tab. 10.3) 
opulation 


500 
k 
0.05 


Roulette 
Simple Inversion 


0.7 
250 


Table 10.2: Experimental results achieved using a standard GA. 


Problem Crossover 


berlin52 
berlin52 
berlin52 
ch130 
ch130 
ch130 
kroa200 
kroa200 
kroa200 


OX 
ERX 
MPX 

OX 
ERX 
MPX 

OX 
ERX 
MPX 


Best Average 


0.00 
5.32 
21.74 
3.90 
142.57 
83.57 
3.14 
325.92 
146.94 


3.76 
7.73 
26.52 
5.41 
142.62 
85.07 
4.69 
336.19 
148.08 


valuated 
Solutions 
12,000,000 
12,000,000 
12,000,000 
12,000,000 
12,000,000 
12,000,000 
12,000,000 
12,000,000 
12,000,000 


Table 10.3: Experimental results achieved using a GA with offspring selection. 


Problem 


berlin52 
berlin52 
berlin52 
berlin52 
ch130 
ch130 
ch130 
ch130 
kroa200 
kroa200 
kroa200 
kroa200 


valuated 
Crossover Best Average Solutions 
1 


OX 
ERX ; 6,784,626 
MPX 3 6,825,199 

OX,ERX,MPX i 7,457,451 

OX i 13,022,20 
ERX í 4,674,485 
MPX t 9,282,291 
OX, ERX, MPX 3 5,758,022 
OX i 15,653,950 
ERX : 19,410,458 
MPX 5 13,626,348 


OX, ERX, MPX ; 9,404,241 


ange to 


standard GA 
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In combination with higher values for the maximum selection pressure, ge- 
netic search can be guided advantageously also for poor crossover operators 
as the larger amount of handicapped offspring are simply not considered for 
the further evolutionary process. Figure 10.1 shows this effect in detail for 
the berlind2 TSP instance. Quite good results in terms of global convergence 
could also be achieved using a combination of different crossover operators, 
as additional genetic diversity is so brought into the population and inferior 
crossover results are not considered due to the enhanced offspring selection 
model. 


30 


25 


v 
S 


Solution Quality 
a 


Ox ERX 


Crossover Operators 


Ei New Selection Scheme {Standard GA 


FIGURE 10.1: Quality improvement using offspring selection and various 
crossover operators (taken from [AW04b]. This figure is displayed with kind 
permission of Springer Science and Business Media. 


10.1.2 Scalability of Global Solution Quality by SASEGASA 


In this part of the experimental section we present the main effects of 
SASEGASA when applied to a practical implementation in a distributed en- 
vironment: A higher number of subpopulations at the beginning of the evo- 
lutionary process allows to achieve scalable improvements in terms of global 
convergence. 
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Table 10.4: Parameter values used in the test runs of the SASEGASA algo- 
rithms with single crossover operators as well as with a combination of the 


operators. 


Parameters for with 1 crossover operator 
(Results presented in Tab. 10.5) 

subpopulation Size 
Elitism Rate 
Mutation Rate 
Selection Operator 
Crossover Operators 
Mutation Operator 


X 


100 
$ 


0.05 resp. 0.00 


Roulette 
OX 


Simple Inversion 


Success Ratio 0.8 
Maximum Selection Pressure 30 
Parameters for E with 1 crossover operator (ERX 
(Results presented in Tab. 10.6) 
subpopulation Size T00 
Elitism Rate 1 


Mutation Rate 
Selection Operator 
Crossover Operators 
Mutation Operator 


0.05 resp. 0.00 


Roulette 
ERX 


Simple Inversion 


Success Ratio 0.8 
Maximum Selection Pressure 30 
Parameters E with 1 crossover operator (MP X 
(Results presented in Tab. 10.7) 

Subpopulation Size TOO 
Elitism Rate 1 


Mutation Rate 
Selection Operator 
Crossover Operators 
Mutation Operator 

Success Ratio 

Maximum Selection Pressure 


Parameters for E with a combination of crossover operators 
(Results presented in Tab. 10.8) 

subpopulation Size 

Elitism Rate 

Mutation Rate 

Selection Operator 

Crossover Operators 

Mutation Operator 

Success Ratio 

Maximum Selection Pressure 


Table 10.5: Results showing the scaling properties of SASEGASA with one 


crossover operator (OX), with and without mutation. 


0.05 resp. 0.00 


Roulette 
MPX 


Simple Inversion 


0.8 
15 


X, ERX, MPX 


100 
1 


0.05 resp. 0.00 


Roulette 


OX, ERX, MPX 
Simple Inversion 


0.8 
15 


Evaluated 

Problem Solutions 
berlin52 Le i 22,577 
berlin52 5 e: 5 731,191 242,195 
berlin52 26 x 1,007,320 751,379 
berlin52 5 3. 2,802,620 2,368,694 
berlin52 a af 8,407,988 7,117,442 
berlin52 š 5 25,154,907 25,045,133 
berlin52 87,850,762 
ch130 59. 9. 28,809 
ch130 5 5 .8¢ 834,049 240,916 
ch130 46 $ 2,210,398 914,765 
ch130 As .25 5,410,587 2,743,967 
ch130 s$ 1s 13,912,314 9,104,041 
ch130 30,082,798 
ch130 f 102,551,323 
kroa200 90. 36. 139,629 34,315 
kroa200 5 26 53. 1,299,129 253,757 
kroa200 : 5 3,155,000 1,066,148 
kroa200 9. 7,689,795 189,587 
kroa200 < a 21,251,916 688,113 
kroa200 32,909,364 
kroa200 116,522,803 


212 Genetic Algorithms and Genetic Programming 


Table 10.6: Results showing the scaling properties of SASEGASA with one 
crossover operator (ERX), with and without mutation. 


Results without mutation 


5 Evaluated Evaluated 
Problem Average Solutions B Solutions 


| 
berlin52 5 7 34,578 30,031 


Results with mutation 


berlin52 8 7 310,088 239,678 
berlin52 i 4 809,083 692,015, 
berlin52 3 ` 229,713 1,962,213 
berlin52 5 5 753,499 6,358,343 
berlin52 Š a 3,020,154 22,299,205 


berlin52 í : 402,610 ,851,322 


ch130 EE 3 T41,314 127,335 
ch130 : I 911,371 702,917 
ch130 x: f: “1,820,004 1,572,299 
ch130 d z ,831,614 3,779,535 
ch130 x -95 3,271,120 10,354,983 
ch130 2i z 36,602,158 32,090,886 

218,379 104,042,226 


kro : 37. 453,054 
,458,083 

“5,462,657 

12,076,655 

28,810,360 

73,702,312 


kroA200 ne : 171,391,466 796,110 


Table 10.7: Results showing the scaling properties of SASEGASA with one 
crossover operator (MPX), with and without mutation. 


Evaluated 

Problem Solutions 
berlin52 1 9.15 A: ,635 n % 58,985 
berlin52 5 ; 3: 497,211 .8¢ -05 418,175 
berlin52 z . »216,238 < 1,153,493 
berlin52 k z ,302,870 $ 2,445,796 
berlin52 $ : 875,130 5 í 9,227,596 
berlin52 5 . ,414,626 . ` 19,769,438 
berlin52 + 5 92,662,669 . . 56,682,137 
ch130 5 . 59,847 3 -19 63,547 
ch130 5 2. p 504,065 36.73 9. 585,371 
ch130 “ 32. ,867,440 1,602,154 
ch130 x à :665,532 3 z 3,875,043 
ch130 .83 š ,096,130 3 7 8,837,255 
ch130 : 3. 379,806 $ $ 26,085,696 
ch130 : E 905,160 : 74,759,771 
kroA200 98. 3. 94,830 : x 89,746 
kroA200 5 5 -53 ,461,829 . 35. 942,102 
kroA200 3. 30.38 ,096,990 30.96 „4E 2,932,813 
kroA200 Br . 9,154,913 .5¢ 37. 7,907,319 
kroA200 i 65 573,066 : : 18,798,917 
kroA200 5 43 ,363,179 $ 2 57,105,958 
kroA200 B: a 759,298 «38 -3 115,013,599 


Indeed, as the Tables 10.5-10.8 show for the different crossover opera- 
tors and their combinations, respectively, achievable solution quality can be 
pushed to highest quality regions also for higher dimensional problems with 
only linearly increasing computational effort by simply increasing the initial 
number of subpopulations. Especially when using a combination of consid- 
ered crossover operators (see Table 10.8), which becomes possible due to self- 
adaptive offspring selection, the global optimum could be detected for all con- 
sidered benchmark problems when using a combination of different crossover 
operators. The SASEGASA parameter settings used for the results given be- 
fore have been chosen in such a way in order to point out certain aspects; 
the parameters used for achieving the results given in Table 10.8 are quite 
advantageous for applying SASEGASA to the TSP. Therefore, also higher 
dimensional test cases are shown in Table 10.8 for which the global optimum 
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Table 10.8: Results showing the scaling properties of SASEGASA with a 
combination of crossover operators (OX, ERX, MPX), with and without mu- 
tation. 


| Results with mutation | Results without mutation 

ub- Evaluated Evaluated 

Problem populations Solutions Solutions 
berlin52 1 0.72 6.12 7 s . 49,866 
berlin52 5 0.46 7 z R 548,284 
berlin52 10 0. a 1,296,130 
berlin52 20 oO. 4,247,216 $ . 3,275,660 
berlin52 40 0. 11,240,451 5 . 7,394,200 
berlin52 80 0. 53,844,262 # is 21,129,365 
berlin52 160 0. 246,725,814 4 K 61,538,007 
ch130 1 34.5 131,650 35. x 117,720 
ch130 5 T: 1,243,637 x i 961,172 
ch130 10 4. 3,275,072 x 5 2,578,121 
ch130 20 1. 9,092,937 . . 6,475,903 
ch130 40 0; 32,446,649 3 18 15,027,715 
ch130 80 0. 77,406,460 ‘. > 41,921,823 
ch130 160 0. 96,545,540 
kroA200 1 0.¢ 208,108 
kroA200 5 0.8 1,555,680 
kroA200 10 6. 5,165,175 +22 ý 3,778,171 
kroA200 20 2. 18,477,477 s v 9,321,037 
kroA200 40 Le 68,132,626 A -63 29,112,958 
kroA200 80 Le 134,467,940 . BE 68,299,249 
kroA200 160 0. 201,322,654 ` .25 131,669,520 
lin318 1 X % 403,431 Z 4 242,534 
lin318 5 3,292,861 34. . 2,338,523 
1in318 10 8,093,264 R 5 5,680,243 
1in318 20 26,534,811 x wd 13,394,560 
1in318 40 200,885,952 s wd 33,267,177 
1in318 80 624,986,088 . . 93,879,278 
lin318 160 959,258,717 a .5E 256,372,204 
1in318 320 2,116,724,528 x ʻ 632,882,394 
1417 1 585,102 x 7 408,664 
£1417 5 5,104,971 32.4% . 3,615,174 
£1417 10 26,586,653 g P 8,451,114 
f1417 20 106,892,925 ʻ 3 22,441,329 
£1417 40 664,674,431 . . 236,658,335 
£1417 80 1,310,193,035 . iTS 519,000,908 
f1417 160 2,122,856,932 i .55 802,368,224 
£1417 320 4,367,217,575 a .d 2,231,720,072 


could also be found by simply increasing the number of demes. Apart from 
the computational effort which becomes higher and higher in a single proces- 
sor environment, the degree of difficulty may be increased by increasing the 
problem dimension. 


The scalability of achievable solution qualities, that comes along with a 
linearly increasing number of generated solutions, is a real advancement to 
classical serial and parallel GA concepts, where a greater number of evalu- 
ated solutions cannot improve global solution quality anymore after the GA 
has prematurely converged. As theoretically considered in the previous chap- 
ters, the reasons for this beneficial behavior are given by the interplay between 
genetic drift and migration embedded in the self-adaptive selection pressure 
steering concept. Even if the achieved results without mutation are not quite 
as good as those achieved by the SASEGASA with a standard mutation rate, 
it is remarkable that the scaling property still holds. We have also executed 
experiments with smaller numbers of larger subpopulations as well as with 
greater numbers of smaller subpopulations. Still, these results are not doc- 
umented here, as these test series showed basically the same results with a 
comparable total effort of evaluated solutions. This is an interesting aspect 
for an efficient practical adaptation to a concrete parallel environment. 
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Even if the achieved results are clearly superior to most of the results re- 
ported for applications of evolutionary algorithms to the TSP [LKM*99], it 
has to be pointed out again that all introduced and applied additions to a stan- 
dard evolutionary algorithm are generic and absolutely no problem-specific 
local pre- or post-optimization techniques have been used in our experiments. 


10.1.3 Comparison of the SASEGASA to the Island-Model 
Coarse-Grained Parallel GA 


The island model is the most frequently applied parallel GA model. More- 
over, the island model is closer related to the newly introduced concepts of the 
present work than other coarse- and fine-grained parallel GA models. There- 
fore, this part of the empirical studies discusses the results that are achievable 
with a conventional island GA compared to the results of SASEGASA. 

A main difference between an island GA and a SASEGASA is the self- 
adaptive selection pressure steering concept which as a side effect allows the 
detection of premature convergence in the algorithm’s subpopulations. It 
therefore becomes possible to select the dates of migration phases dynamically 
and the SASEGASA algorithm is no more dependent on static migration inter- 
vals as the island model is. Furthermore, especially in the migration phases, 
the self-adaptive selection pressure steering concept of the SASEGASA en- 
ables the algorithm to join the genetic information of individuals descending 
from different subpopulations in a more directed way than within the island 
model. Less fit offspring, that may especially emerge in the migration phases 
as children from parents descending from very different regions of the solu- 
tion space, are simply not considered for the ongoing evolutionary process 
due to offspring selection. In addition to this, it is also not useful to apply a 
combination of crossover operators within the standard island model, as each 
crossover result would become part of the ongoing evolutionary process since 
no offspring selection steps are performed. In contrast to this, the SASEGASA 
maintains only those crossover results that represent a successful combination 
of their parents’ attributes, which makes a combination of more operators 
reasonable. 

The Tables 10.10-10.12 show the results for the island GA using the same 
TSP benchmarks as those that we have also used for testing the SASEGASA 
applying either OX (see Table 10.10), ERX (see Table 10.11), or MPX (see 
Table 10.12) as crossover mechanisms, each with and without mutation. 
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Table 10.9: Parameter values used in the test runs of a island model GA with 
various operators and various numbers of demes. 


Parameters for the Island GA 
(Results presented in Tab. 10.10—Tab. 10.12) 


eme Population 5ize 100 
Elitism Rate 1 
Mutation Rate 0.05 resp. 0.00 
Selection Operator Roulette 


Crossover Operators 
Mutation Operator 
Migration Interval 
Migration Rate 
Migration Topology 
Migration Selection 
Migration Insertion 


OX, ERX, resp. MPX 
Simple Inversion 

20 Rounds 

0.15 (of deme) 
unidirectional ring 
Best 

Random 


Table 10.10: Results showing the scaling properties of an island GA with 
one crossover operator (OX) using roulette-wheel selection, with and without 
mutation. 


Evaluated 


Problem Solutions 
berlin52 2.29 4.06 1,500,000 3. 1,500,000 
berlin52 A x 7,500,000 E 7,500,000 
berlin52 Z y 15,000,000 3 15,000,000 
berlin52 x K 30,000,000 $ 30,000,000 
berlin52 X $ 60,000,000 k 60,000,000 
berlin52 2 $ 120,000,000 .55 5 120,000,000 
berlin52 $ $ 240,000,000 a x 240,000,000 
ch130 a 1,500,000 3.5 1,500,000 
ch130 a ʻ 7,500,000 . 7,500,000 
ch130 f: a 15,000,000 9. 15,000,000 
ch130 > 3.45 30,000,000 91.93 30,000,000 
ch130 `, Af 60,000,000 x 60,000,000 
ch130 i Ş 120,000,000 91.88 120,000,000 
ch130 .55 ES 240,000,000 0% 240,000,000 
kroa200 š . 1,500,000 ES 1,500,000 
kroa200 5 s & 7,500,000 5 7,500,000 
kroa200 7 i 15,000,000 s 15,000,000 
kroa200 3 a 30,000,000 .6E 30,000,000 
kroa200 a af 60,000,000 ute 60,000,000 
kroa200 5 A 120,000,000 9s 120,000,000 
kroa200 a x 240,000,000 x 240,000,000 


As already noticed for the conventional GA (see Table 10.2), the results of 
the island GA are also quite good when using the OX crossover operator (see 
Table 10.10) and (in terms of solution quality) comparable to the SASEGASA 
results obtained using the OX crossover. Still, the computational effort (i.e., 
the number of evaluated solutions) is comparatively high in order to achieve 
the results as migration is in the island GA applied in a less goal-oriented way. 
As only mutation and migration are qualified to regain alleles that are lost 
due to genetic drift, there is further empirical evidence that migration works 
less effectively in the island model when considering the island GA results; 
these, in fact, are really bad when deactivating mutation. The SASEGASA 
is, in contrast to this, still able to scale up solution quality to high quality 
regions even without mutation (which can be seen in Tables 10.5- 10.8). 

The results returned by the island GA using ERX and MPX crossovers are 
rather weak regardless of mutation, and are significantly outperformed by the 
SASEGASA results. As we have already seen for the conventional GA, the 
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Table 10.11: Results showing the scaling properties of an island GA with one 
crossover operator (ERX) using roulette-wheel selection, with and without 
mutation. 


Evaluated 

Problem Solutions 
berlin52 1,500,000 
berlin52 5 f A K 7,500,000 
berlin52 a x 5,000,000 3.3 3 15,000,000 
berlin52 F z 30,000,000 a i 30,000,000 
berlin52 f: s£ 000,000 5 ; 60,000,000 
berlin52 he 2 ,;000,000 $ : 120,000,000 
berlin52 š j 000,000 . x 240,000,000 
ch130 .45 3 ,900,000 3 A 1,500,000 
ch130 5 9. k ,500,000 5S í 7,500,000 
ch130 ; 91. 5,000,000 3. .08 15,000,000 
ch130 $ K: 30,000,000 .93 k 30,000,000 
ch130 ; .63 ,000,000 r ka: 60,000,000 
ch130 7 wd ,;000,000 of wd 120,000,000 
ch130 a 39. ,000,000 38. 5 240,000,000 
kroa200 3 FE 500,000 5 OE 1,500,000 
kroa200 5 28 Ps ,500,000 a 5 3 i 7,500,000 
kroa200 5 3 5,000,000 3 309. 15,000,000 
kroa200 : 32.56 : 30,000,000 .53 s 30,000,000 
kroa200 9. : ,000,000 37. i 60,000,000 
kroa200 z 9. ,;000,000 x 33. 120,000,000 
kroa200 n: F 000,000 2 x 240,000,000 


Table 10.12: Results showing the scaling properties of an island GA with one 
crossover operator (MPX) using roulette-wheel selection, with and without 
mutation. 


Evaluated 
Problem Solutions 
berlin52 33.64 36. 1,500,000 5 wd 1,500,000 
berlin52 33 y :500,000 -83 53 7,500,000 
berlin52 f . 5,000,000 -83 n 15,000,000 
berlin52 2 3. f 30,000,000 9. 2 30,000,000 
berlin52 K af ,;000,000 3. -63 60,000,000 
berlin52 y : ,;000,000 a : 120,000,000 
berlin52 D x 000,000 ps 3. 240,000,000 
chis 9. : 3 Fi 5 Z 7D00, 
ch130 5 oa 4 ,500,000 CTA a 7,500,000 
ch130 me ; ,000,000 3 7 15,000,000 
ch130 -3 3 30,000,000 AE i 30,000,000 
ch130 36. 2 ,000,000 30.95 39. 60,000,000 
ch130 31. 38. ,;000,000 E 34. 120,000,000 
ch130 34.6 .03 000,000 9.3 32. 240,000,000 
kroa200 A: 2 ,000,000 -85 a8 1,500,000 
kroa200 E a ,500,000 & 3 7,500,000 
kroa200 5 $ 5,000,000 -75 : 15,000,000 
kroa200 91. 98.3 30,000,000 : : 30,000,000 
kroa200 3.3 91.3 ,000,000 .85 23 60,000,000 
kroa200 ig 90. 000,000 £ à 120,000,000 
kroa200 2a y 000,000 4; .2¢ 240,000,000 


island model also does not offer concepts for dropping out disadvantageous 
crossover results. 

Thus, in contrast to most of the enhanced GA concepts discussed in lit- 
erature which are in most cases tuned for some specific purpose, it appears 
that the SASEGASA algorithm acts very stabilizing under various condi- 
tions. It is also quite impressive that the generic applicability and transference 
of the positive attributes of SASEGASA appear unimpaired when consider- 
ing a completely different optimization problem - namely the optimization 
of hard real-valued benchmark test functions in high dimensions [Aff05]. It 
has been reported in [Aff05] that the SASEGASA algorithm without any 
problem-specific adaptations is able to find the global optimal solution for 


Applications of Genetic Algorithms: Combinatorial Optimization 217 


all considered benchmark test functions (Rosenbrock, Rastrigin, Griewangk, 
Ackley, and Schwefel’s sine root function) in dimensions that have hardly been 
discussed in GA literature (up to n = 2000). 


10.1.4 Genetic Diversity Analysis for the Different GA 
Types 


As already mentioned in the introductory part of this chapter, we do not 
confine ourselves to report the results in table form and try to compare the 
internal functioning of the certain algorithmic variants also here. In contrast 
to Chapter 7 we here consciously abandon information about globally optimal 
solutions which are unknown also in practical applications. 

Results that are as interesting as those achieved by observing the dynamics 
of the global optimal alleles can be obtained by analyzing the genetic diversity 
distribution during the run of a GA. For this purpose it is necessary to de- 
fine an appropriate distance measure between two solution candidates for the 
problem representation at hand. In contrast to GP-based structure identifi- 
cation diversity analyses such a distance measure is quite intuitive and easy 
to describe for the TSP. 

The similarity measure between two TSP-solutions tı and tə used here is 
defined as a similarity value sim between 0 and 1: 


|e:e€ E(ti) Ae € E(te) | 


EG € [0,1] (10.2) 


sim(tı, t2) = 


giving the quotient of the number of common edges in the TSP solutions tı 
and t2 and the total number of edges. Æ here denotes the set of edges in a 
tour. The according distance measure can then be defined as 


d(tı, t2) =1— sim(tı, t2) E [0, 1] (10.3) 


Thus, the similarity or the distance of two concrete TSP solutions can be 
measured on a linear scale between the values 0 and 1. 

A very detailed representation of genetic diversity in a population is the 
graphical display of pairwise similarities or distances for all members of a 
population. An appropriate measure, which is provided in the HeuristicLab 
framework, is to illustrate the similarity as a n x n matrix where each entry 
indicates the similarity in form of a grey scaled value. Figure 10.2 shows 
an example: The darker the (i,j) — th entry in the n x n grid is, the more 
similar are the two solutions ¿į and j. Not surprisingly, the diagonal entries, 
which stand for the similarity of solution candidates with themselves, are 
black indicating maximum similarity. 

Unfortunately, this representation is not very well suited for a static 
monochrome figure. Therefore, the dynamics of this n x n color grid over 
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Population Diversity Analysis (Iteration 10) 


50 


FIGURE 10.2: Degree of similarity/distance for all pairs of solutions in a 
SGA’s population of 120 solution candidates after 10 generations. 


the generations is shown in numerous colored animations available at the 
website of this book?. 

For a meaningful figure representation of genetic diversity over time it is 
necessary to summarize the similarity /distance information of the entire pop- 
ulation in a single value. An average value of all (3) combinations of solution 
pairs in form of a mean/max similarity value of the entire population as a 
value between 0 and 1 can be calculated according to the Formulas 6.5 to 6.8 
stated in Chapter 6. This form of representation allows to display genetic 
diversity over the generations in a single curve. Small values around 0 in- 
dicate low average similarity, i.e., high genetic diversity and vice versa high 
similarity values of almost 1 indicate little genetic diversity in the population. 
In the following we show results of exemplary test runs of GAs applied to the 
kroA200 200 city TSP instance taken from the TSPLib using the parameter 
settings given in Table 10.1 and OX crossover. 

Figures 10.3 and 10.4 show the genetic diversity curves over the generations 
for a conventional standard genetic algorithm as well as for a typical offspring 
selection GA. The gray scaled values in the Figures 10.3, 10.4, and 10.5 show 
the progress of mean similarity values of each individual (compared to all 
others in the population); average similarity values are represented by solid 
black lines. 

For the standard GA it is observable that the similarity among the solu- 
tion candidates of a population increases very rapidly causing little genetic 


*http://gagp2009 .heuristiclab.com 
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Population Diversity Analysis 


o 2000 4000 6000 8000 10000 


FIGURE 10.3: Genetic diversity in the population of a conventional GA over 
time. 


diversity already after a couple of generations; it is only mutation which is 
responsible for reintroducing some new diversity keeping the evolutionary pro- 
cess going. As already explained in Chapter 7, without mutation the standard 
GA tends to prematurely converge very rapidly. 


Population Diversity Analysis 


Iterations 


FIGURE 10.4: Genetic diversity of the population of a GA with offspring 
selection over time. 


Equipped with offspring selection the results turn out to be completely dif- 
ferent: The average similarity in the population increases slowly and steadily 
from 0 to 1. This means that the high degree of genetic diversity, which is 
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available initially, is decreased very slowly and carefully yielding to a state of 
convergence when no more diversity is available in the population which is de- 
tected by the algorithm by reaching a maximum selection pressure value (see 
Chapter 4). As already discussed in Chapter 7 by analyzing the dynamics of 
the alleles of the global optimal solution, also the dynamic of genetic diversity 
shows that an offspring selection GA is rarely dependent on mutation and 
operates much closer to the general assumptions stated in the building block 
hypotheses and the according schema theorem as a comparable conventional 
GA is able to do for general problem formulations. 


Population Diversity Analysis 


FIGURE 10.5: Genetic diversity of the entire population over time for a 
SASEGASA with 5 subpopulations. 


The analysis of genetic diversity over time for the SASEGASA is shown 
in Figure 10.5. Similar to the diversity analyses for a conventional GA and 
for the offspring selection GA, the genetic diversity analysis has also been 
done for the SASEGASA applied to the kroA200 200-city benchmark TSP 
instance using the parameters given in Table 10.4 and the combination of 
crossover operators OX, ERX, and MPX. As we can see in Figure 10.5, the 
genetic diversity is still rather high at the first reunification phase around iter- 
ation 2000 where the genetic diversity in each single subpopulation is already 
lost. This means that even if there is no more genetic diversity in each of the 
subpopulations itself, there is still quite a lot of genetic diversity in the entire 
population. This means that the certain subpopulations must have drifted to 
quite different regions of the search space which is consistent with the theoret- 
ical considerations of Chapter 5. After each reunification step (the next one 
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from 4 to 3 subpopulations is around iteration 4000) the average similarity 
(which is inversely proportional to the average genetic diversity) stabilizes at 
a higher level indicating lower average genetic diversity after each reunifica- 
tion. The iterations between the certain migration phases are responsible for 
getting essential alleles (which are part of the global optimum or at least of 
a high quality solution) fixed in order to let the SASGEASA operate benefi- 
cially; this is in fact the case in our concrete example, as we also see in the 
results stated in Table 10.8). 


Summarizing these results it can be stated for the TSP experiments that 
the analysis of genetic diversity in the population confirmed the results of 
Chapter 7 without using any information about the concrete search space 
topology. The illustration in form of a static figure is certainly some kind of 
restriction when the dynamics of a system should be observed. For that reason 
the book’s website contains some additional material showing the dynamics 
of pairwise similarities for all members of the population (as indicated in 
Figure 10.2) in the form of short motion pictures. 


10.2 Capacitated Vehicle Routing 


Similar to the practical study on the TSP problem we have also applied 
several algorithms on the capacitated vehicle routing problem, to several in- 
stances of the Taillard benchmark set [Tai93]. This set consists of 14 instances 
from 75 to 385 cities of which we picked the first two instances of those with 
75 cities, one with 100 cities, and one with 150 cities. There is no proven 
globally optimal solution to these instances. Several authors, including Tail- 
lard himself, have published best known solutions; a new best known solution 
in one instance with 75 cities was discovered recently [ADO6]. 

The instances were interp reted according to the definition of a CVRPTW 
as presented in Chapter 8.2. Since the CVRP does not include time windows, 
but only demands, artificial ready times, service times, and due dates have 
been added such that the size of the time window is 216. This is high enough 
so that the time windows do not constrain the solution. Additionally, there is 
no maximum number of vehicles given; thus the number of possible vehicles 
was predefined by the number of customers, so that in the worst case every 
customer was serviced by a separate vehicle. Since any additional vehicle will 
always remain unused, our constraint on the maximum number of vehicles did 
not also constrain the solution space. 

The representation that is chosen is a path encoding with trip delimiters, 
similar to the approach given in [PB96]. The number 0 represents the depot; 
all numbers greater than 0 represent customers which are visited in the order 
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they appear in the array when reading it from left to right. There are as many 
Os in any array as there are vehicles plus an additional 0 which terminates a 
representation. Since all customers have to be visited and no customer can 
be visited more than one time, the size of the encoding is of fixed length 
throughout the optimization process. Unused vehicles are marked as empty 
routes and are represented by two subsequent Os. During crossover, each 
string is sorted so that empty routes are listed after the last active vehicle has 
completed its tour. There is, however, no specific order for active vehicles. 


10.2.1 Results Achieved Using Standard Genetic 
Algorithms 


The genetic algorithm which we applied uses some of the operators de- 
scribed in Chapter 8 and was applied in six different configurations shown 
in Table 10.13. The algorithm has been applied five times per instance. By 
experimentation we want to analyze the GA on the one hand by using differ- 
ent mutation operators and on the other hand by choosing different selection 
operators. 

The two following main test scenarios are set up, the first one with lower se- 
lection pressure using roulette wheel selection and the second one with higher 
selection pressure using 3-tournament selection with a group size of three. 
Both of these scenarios are tested with different settings for mutation oper- 
ators, among them the previously described M1, M2, and LSM as optimiz- 
ing mutation operators which aim to improve solutions with some knowledge 
about the fitness function (the distance between the cities) as well as non- 
optimizing mutation operators which do not know about the fitness function 
and therefore make unbiased choices. We group the mutation operators within 
a single genetic algorithm and give them the same probability by dividing the 
mutation rate through the number of mutation operators in the particular 
test. The only exception is the LSM which only has a 0.0001% chance of 
being selected due to its computational complexity. 

The fitness function is described in Chapter 8; it simply calculates the total 
traveled Euclidean distance. 


10.2.1.1 Quality Progress of the Genetic Algorithm 


The GA is barely able to thrive its population towards a high quality region. 
Optimization mainly depends on the presence of 1-elitism, which preserves the 
best found solution from generation to generation. Given this behavior it is 
not completely puzzling that local search methods like tabu search achieve 
good performances on these problems, as the GA in this form is not able 
to exploit much of the genetic information in the population; this happens 
especially when selection pressure is lower, as we see for example in the results 
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Table 10.13: Parameter values used in the CVRP test runs applying a stan- 
dard GA. 


Parameters for the SGA 
(Results presented in Tab. 10.14) 


enerations 2000 
Population Size 200 
Elitism Rate 1 
Mutation Rate 0.06 
Selection Operator Roulette, 3-Tournament 
Crossover Operators {SBX, RBX} 

{M1, M2} 


Mutation Operators {M1, M2, LSM} 


{Relocate, Exchange, 2-Opt} 


shown for the SGA using roulette wheel selection. In Figure 10.6 we compare 
the results according to the used parent selection operators. Using higher 
selection pressure, the average and worst qualities are maintained at a slightly 
better level when picking the best performing test for each. Still with both 
selection strategies the average and worst qualities are not improving over the 
course of 2,000 generations. From the quality charts we are able to see that 
the diversity is very high, which will become obvious again when we take a 
look at the diversity progress. 


2500 2500. 


Quality Values Quality Values 


2000 2000 


o 500 1000 1500 2000 0 500 1 000 1500 2000 
Generations Generations 


2500 
Quality Values Quality Values 
2000 2000 


1500 1500. 


o 500 1000 1500 2000 0 500 1 000 1500 2000 
Generations Generations 


FIGURE 10.6: Quality progress of a standard GA using roulette wheel selec- 
tion on the left and 3-tournament selection the right side, applied to instances 
of the Taillard CVRP benchmark: tai75a (top) and tai75b (bottom). 
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From our observations it seems that the RBX is better suited to further 
optimize the best solution. Analyzing one of the tests in a more detailed way, 
we see that RBX is responsible in approximately 75% of the evolutionary 
cycles of selection, crossover, and mutation in which a new best solution was 
found whereas SBX is responsible for only 25% of the cases. We will find 
similar behavior for the genetic algorithm using offspring selection where the 
SBX operator is working better when the population is diverse and of worse 
quality than when the population has converged and is of better quality. Both 
operators also benefit from a local search like behavior in the repair procedure. 
As already described in Chapter 8, many different approaches enhance the GA 
with local search to treat the VRP. 


10.2.1.2 Diversity Progress of the Genetic Algorithm 


In the diversity progress charts shown in Figure 10.7 we see how lower 
selection pressures leave individuals in the population which do not have 
many common edges, while there is higher mutual similarity when using 3- 
tournament selection for example. Similar individuals have several edges in 
common, and when these are crossed, the common edges will remain and other 
edges will be taken from either one of the two parents. Through selection by 
fitness the common edges amounting in the population are those of good qual- 
ity. So, ideally the mutual similarity of the GA’s individuals should increase 
slowly in order to go from exploration to exploitation. Thus, the algorithm 
should start with a diverse population and end up with a highly similar pop- 
ulation with each solution being either the optimal solution or with a quality 
close to it. 


The similarity measure for two VRP solutions tı and tə is calculated in 
analogy to the TSP similarity using edgewise comparisons. However, as big 
routes in the VRP are subdivided into smaller routes, a maximum similarity 
siMmazx is calculated for each route r € tı to all routes s € t2. These values 
are summed for all routes r; and finally divided by the number of routes. 


As we have seen already in the quality chart the GA is in this example not 
able to decrease the diversity over the course of the optimization. This could 
also result in good solutions as is shown when examining the final achieved 
solution qualities in Table 10.14. Overall, the GA shows a behavior closer 
to that of a trajectory-based approach than a population-based approach. In 
a trajectory-based approach, there is only a single solution which is slightly 
modified by mutation and accepted as new solution if some criteria are met. 
One characteristic of trajectory-based approaches is their ability to exploit the 
search space in local regions by finding local optima. As is the case with the 
GA here, the best individual of a generation is saved to the next generation 
and maintains a strong line of good quality genes. 
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FIGURE 10.7: Genetic diversity in the population of a GA with roulette 
wheel selection (shown on the left side) and 3-tournament selection (shown 
on the right side). 


10.2.1.3 Empirical Results 


Results are listed in Table 10.14 showing the average of the best qualities 
found and the standard deviation from the mean in percent as well as the 
quality of the best found solution. These results show that the selection 
pressure applied by roulette wheel selection was not enough to guarantee 
good solution qualities; the average best quality is worse than 20-30% worse 
than the best known solution. Additionally, the quality values vary to a 
greater degree, which also suggests that the results are not close to a potential 
optimum. 

When using higher selection pressure, for example by applying tournament 
selection with higher group sizes, the GA is able to achieve formidable average 
best qualities on the two benchmark instances used here. The results are 
around 1% worse than the best known solution in most of the cases, but still 
the GA was not able to find the best known solution. Interesting in this 
context is that the choice of mutation operators matters more when the GA 
performs worse, as is the case with roulette-wheel selection, than when it 
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performs better. In our example using 3-tournament selection the choice of 
mutation operators is less important and good results can be achieved using 
both optimizing as well as nonoptimizing mutation operators. 


Table 10.14: Results of a GA using roulette-wheel selection, 3-tournament 
selection and various mutation operators. 


| Roulette Wheel Selection 


Best Known 


| 

Relocate 
Problem Exchange/2-Opt | M1/M2 M1/M2/LSM 
tai75a 729.57+1.65% 1665.86+1. 1670.18+0.98% 1618.36 
Best found 713.00 1641.36 1654.62 
tai75b 396.37+0.80% 1361.750. 1365.54+0.36% 1344.62 
Best found 387.64 1352.05 1360.36 

Relocate 
Problem Exchange/2-Opt M1/M2/LSM 
tai75a 635.23+0.61% 1637.160. 1634.29+0.54% 1618.36 
Best found 622.66 1619.22 1623.57 
tai75b 1353.51+0.17% 1358.121. 1355.35+0.23% 1344.62 
Best found 350.85 1347.05 1352.02 


A statistical comparison on the results between the GA with roulette wheel 
selection and 3-tournament selection shows the advantageous performance of 
the GA with 3-tournament selection for these benchmark instances. A box 
plot of the results is shown in Figure 10.8. We have compared these results 
pairwise, on the one hand roulette-wheel selection and on the other hand 
3-tournament selection each time with the same mutation operators using a 
two sided t-test. The hypothesis that the mean values of the results are equal 
is rejected at a significance level of 0.05 in four out of the six comparisons. 
As the means of the results achieved using 3-tournament selection are lower 
than those achieved using roulette-wheel selection, we conclude that a higher 
selection pressure is responsible for better performance. 


10.2.2 Results Achieved Using Genetic Algorithms with 
Offspring Selection 


A genetic algorithm with offspring selection is quite successful insofar as 
it can direct the whole population towards the global optimum or a solution 
with a quality close to it. In this test the GA with OS does not make use of 
parental selection operators, but randomly selects parents, crosses them, and 
mutates the children with a certain probability. Accepted offspring individuals 
must have a quality better than the best parent (the comparison factor is set 
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FIGURE 10.8: Box plots of the qualities produced by a GA with roulette and 
3-tournament selection, applied to the problem instances tai75a (top) and 
tai75b (bottom). 


to 1). Infeasible solutions are penalized using high punishment factors; thus, 
the algorithm, starting with a randomly created but feasible population, will 
remain in the feasible region at any time. The parameters for these tests are 
listed in Table 10.15. 

Similar to the tests with the standard GA several scenarios have been se- 
lected and compared. The GA with offspring selection is applied to 75 cus- 
tomer CVRP instances using population sizes of 200 as well as 400, and also to 
higher instances with population sizes of 500 and 1000. For each test scenario 
the selection operators, crossover operators, and mutation rate are fixed, but 
the mutation operators vary between nonoptimizing operators for which we 
have chosen relocate, exchange, and 2-Opt similar to the GA tests as well as 
optimizing operators such as M1, M2, and LSM with the same considerations 
as above; one test was done without mutation. 

We have used two termination criteria that will stop the execution as soon 
as one of them is satisfied. The first one is based on reaching the maximum 
selection pressure barrier and the other one limits the number of evaluated 
solutions to 400,000 which is the same number of evaluations the standard GA 
has been given on these instances. The GA with OS and a population size of 
200 always terminates before the maximum number of evaluated solutions has 
been reached and lists around 250,000 to 300,000 evaluated solutions at the 
end. Using a population size of 400 it has always terminated because it reached 
the upper limit of 400,000 evaluated solutions prior to reaching maximum 
selection pressure. The GA with OS and a population size of 500 is given a 


228 Genetic Algorithms and Genetic Programming 


maximum amount of 1,500,000 evaluations for the 100 customer problem and 
2,000,000 for the 150 customer problem due to the bigger complexity of the 
instances it has to solve. A GA with offspring selection and a population size 
of 1000 was also applied to the tai385 problem instance with a maximum of 
10,000,000 evaluations. 


Table 10.15: Parameter values used in CVRP test runs applying a GA with 
OS. 


Parameters for the GA with OS 
(Results presented in Table 10.16—Table 10.17) 


Population Size 200, 400, 500 
Elitism Rate 1 
Mutation Rate 0.06, 0.0 
Selection Operator Random 
Crossover Operators {SBX, RBX} 

{M1, M2} 


Mutation Operators {M1, M2, LSM} 

{Relocate, Exchange, 2-Opt} 
Success Ratio 1 
Comparison Factor Bounds 


Maximum Selection Pressure 


1-1 
200 


10.2.2.1 Improvement in Quality Progress with Offspring Selection 


A benefit of using offspring selection is the automatic adaption of the nec- 
essary selection pressure. Instead of choosing between roulette wheel, linear 
rank or tournament selection, and an appropriate group size, it is feasible to 
simply use random parent selection. The selection and reproduction phases 
will be repeated as long as the necessary amount of individuals fulfilling the 
success criterion can been generated. Thus, the algorithm will use less selec- 
tion pressure when the criterion is easily satisfied and will apply more selection 
pressure when the criterion is harder to be satisfied. Random selection allows 
the worst individual to be selected as often as the best individual. 

The GA with OS is quite successful even without mutation; this is what 
we had expected given the analyses in Chapter 7. Figure 10.9 shows the 
quality progresses of the offspring selection GA. The number of generations 
is fairly low compared to a conventional genetic algorithm, but more work 
is done per generation. The curves show the typical behavior of a GA with 
OS where worst, average, and best quality values converge at the end of 
the evolutionary process. The result is a highly similar population of good 
quality; at this point genetic diversity, as we will see below, has mostly been 
lost. So, the algorithm cannot proceed further to create new better solutions 
and terminates. Using a higher population size such as 400 in our case, but 
with the same number of evaluated solutions the algorithm terminates before 
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it reduces the genetic diversity to the point where no further better solution 
can be created. Nevertheless, the GA with OS finds better solutions as the 
higher population size can hold more genetic information as well as it uses 
about 100,000 evaluations more than with a population size of 200. 


4000. 


Quality Values Quality Values 
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Generations Generations 
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FIGURE 10.9: Quality progress of the offspring selection GA for the instances 
(from top to bottom) tai75a and tai75b. The left column shows the progress 
with a population size of 200, while in the right column the GA with offspring 
selection uses a population size of 400. 


In Figure 10.10 the influence of the crossover operators in each generation 
is shown. It shows how many offspring in each generation are created by 
crossing them with SBX or RBX in percent. The higher the values, the more 
frequently one operator was able to create better offspring which exceeds the 
quality of the best parent in the GA with OS here. It can be seen that SBX 
initially is able to produce slightly more successful children as the RBX, but 
as the population converges and improves in quality RBX produces better 
offspring to a higher degree. 
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FIGURE 10.10: Influence of the crossover operators SBX and RBX on each 
generation of an offspring selection algorithm. The lighter line represents the 
RBX; the darker line represents the SBX. 


10.2.2.2 Improved Diversity Progress with Offspring Selection 


The diversity progress shown in Figure 10.11 is similar to what has been 
observed for the TSP. The GA with OS starts with the same diverse initial 
population that the GA starts with, but is able to slowly spread the good 
genetic information among the population so that in the end the similarity 
of the individuals rises and the algorithm progresses from exploration to ex- 
ploitation. At the end of the search, genetic diversity is close to 1, so almost 
all the individuals in the population share the same edges. This behavior has 
already been analyzed in Chapter 7. The results are slightly different when 
using a higher population size. The algorithm finishes before it can reduce the 
genetic diversity in the population and thus the diversity progress looks cut 
off. Nevertheless, as we will see in the next section, the results are improved. 
Since the GA with OS and a population size of 400 has room for further op- 
timization as there is still enough diversity, allowing more evaluations could 
result in even better results. 


10.2.2.3 Empirical Results 


Results show a very sound performance of the GA using offspring selection: 
It is able to get very close to the optimum and to find it in even much more 
cases than the GA without offspring selection. Increasing the population size 
to 400 individuals allowed offspring selection to find the best known solution 
much more often. The only exception is the tai75b instance where the best 
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FIGURE 10.11: Genetic diversity in the population of an GA with offspring 
selection and a population size of 200 on the left and 400 on the right for the 
problem instances tai75a and tai75b (from top to bottom). 


known solution is not found that easily; it seems that it also requires a bit of 
luck. Finding the “2nd best known solution”, which had been the best-known 
solution for a while, is considerably easier: In 14 out of 20 runs the GA with 
OS and a population size of 400 was able to find it, but only in a single run out 
of 20 it could find the currently best known solution. It may be possible that 
the best known solution does not lie within an attracting region for the GA, 
which is probably also the reason for its late discovery in [AD06]. Regarding 
solution quality, the currently best known solution quality is 1344.618 while 
the “2nd best known solution” has a quality of 1344.637. 


Analyzing the results reported in Table 10.15 we see that the choice of the 
mutation operator in our genetic algorithm using offspring selection is again of 
less importance. The best results are computed using nonoptimizing mutation 
operators as well as a combination of M1 and M2 with local search. Omitting 
mutation leads to good results in general with average best solution qualities 
close to the best known solution qualities. 
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Table 10.16: Results of a GA with offspring selection and population sizes of 
200 and 400 and various mutation operators. The configuration is listed in 
Table 10.15. 


[Population Size 200 — — J ost’ Known 

Relocate 
Problem Exchange/2-Opt | M1/M2 M1/M2/LSM No Mutation 
tai75a 1620.26+0.16% 1622.0340.15% 1622.48+0.06% 1622.72+0.07% 1618.36 
Best found 1618.36 1618.36 1621.96 1621.95 
tai75b 1346.26+0.24% 1345.78+0.06% 1345.85+0.12% 1345.71+40.16% 1344.62 
Best found 1344.64 1344.64 1344.64 1344.64 

pulation 

Relocat 
Problem Exchange/2-Opt | M1/M2 M1/M2/LSM No Mutation 
tai75a 1620.68+0.11% 1620.52+0.18% 1618.73+0.03% 1621.02+0.13% 1618.36 
Best found 1618.71 1618.36 1618.36 1618.36 
tai75b 1344.67+0.00% 1344.64+0.00% 1344.63+0.00% 1344.83+0.03% 1344.62 
Best found 1344.64 1344.64 1344.62 1344.64 


From the results we can also see that the GA with OS benefits from a higher 
population size insofar as it is able to get closer to the best known solution 
on average and finding it more often. Given the small number of replications, 
however, no statistical significance can be drawn; still, as the box plots in 
Figure 10.12 show, using a higher population size results in more robust tests 
with smaller standard deviations of the results’ qualities as well as quality 
values closer to that of the best known solutions. This is not surprising as it 
has been discussed that a larger initial population is more likely to hold all the 
relevant alleles which are to be identified and assembled in a single solution 
during the optimization process. A larger population can hold more diverse 
solutions which prevents important alleles from disappearing. Naturally, it 
takes more effort for a larger population to converge and thus the number of 
evaluated solutions increases with the population size. Population size in an 
offspring selection genetic algorithm is a tradeoff between achievable quality 
and effort; in a traditional GA, increasing the population size has a similar 
effect only when the parent selection pressure is increased accordingly. This 
may for example be achieved by using tournament selection with an increased 
tournament group size. 


The results returned by the standard GA and the GA with OS are compared 
in Figure 10.13 which shows the box plots of the results’ qualities of these two 
GA variants. Here we see that the results of the GA using offspring selection 
are generally more robust insofar as they are of good quality and do not spread 
as much as the results returned by the standard GA using 3-tournament 
selection. Again, a pairwise two sided t-test of the results computed with the 
standard GA compared to the offspring selection GA rejected the hypothesis 
that the means of these results are equal at a significance level of 0.05. As the 
means of the offspring selection GA are lower than standard GA, it is thus 
feasible to assume that the offspring selection GA performs indeed better than 
the standard GA. 
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FIGURE 10.12: Box plots of the offspring selection GA with a population 
size of 200 and 400 for the instances tai75a and tai75b. 
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FIGURE 10.13: Box plots of the GA with 3-tournament selection against the 
offspring selection GA for the instances tai75a (shown in the upper part) and 
tai75b (shown in the lower part). 


We have also applied a GA with offspring selection for solving more complex 
problem instances, specifically one with 100 and one with 150 customers as 
well as on the most complex instance with 385 customers. The algorithm is 
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suited well to get very close to the best known solution in this configuration, 
though it is likely that population size still needs to be increased as the best 
known solution could not be reached in any of the test runs. The best solution 
found for the tai100a instance has a quality of 2062.25 and is about 1% worse 
than the currently best known solution; the offspring selection GA achieved 
average best qualities 1-2% worse than the currently best known solution. In 
all cases except two it could finish before reaching the maximum amount of 
evaluated solutions, having evaluated on average 1.2 to 1.3 million solutions. 
For the taz150a instance the algorithm finished on average having evaluated 
1.9 million solutions; some runs, however, ran into the maximum of 2 million 
solution evaluations. The best solution in 5 test runs has a quality of 3068.04 
and is 0.42% worse than the best known one; it also achieves average best 
qualities approximately 1% worse than the best known solution. The results 
are given in Table 10.17. 

For the tai385 problem instance, which is the largest instance in Taillard’s 
benchmark set, a good result could be achieved as well. Here the customers 
are modeled according to the locations of the most important towns or villages 
in the smallest political entities in the Swiss canton of Vaud. The demand is 
modeled proportional to the number of inhabitants living there [Tai93]. Using 
an offspring selection GA with a population size of 1000 without mutation, 
the final tour length found is 25,498.40 after 10 million evaluated solutions. 
This is 4.37% higher than the currently best known solution with a tour 
length of 24,431.44. It is likely that better results can be achieved with even 
higher population sizes, a point where parallelization becomes more and more 
important in order to achieve results in adequate time. 


Table 10.17: Showing results of a GA with offspring and a population size 
of 500 and various mutation operators. The configuration is listed in Table 
10.15. 

Population Size 500d Beat Known 


Relocate 


Problem Exchange/2-Opt | M1/M2 M1/M2/LSM | No Mutation 
tail00a 2081.3040.22% 2077.8940.48% | 2079.99+0.21% | 2078.60+0.26% 2041.34 
Best found |} 2074.56 2062.25 2073.22 2073.55 


tail50a 3082.48+0.37% 3078.79+0.23% 3087.86+0.39% 3086.44+0.48% 3055.23 
Best found 3068.04 3071.54 3074.44 3068.54 


Chapter 11 


Data-Based Modeling with Genetic 
Programming 


11.1 Time Series Analysis 


Whenever (input or output) data of any kind of system are recorded over 
time and compiled in data collections as sequences of data points, then these 
sequences are called time series; typically, these data points are recorded at 
time intervals which are often, but not always uniform. 


The collection of methods and approaches which are used for trying to 
understand the underlying mechanisms that are documented in time series is 
called time series analysis; but not only do we want to know what produced 
the data, but what we are also interested in is to predict future values, i.e., 
we want to develop models that can be used as predictors for the system at 
hand. 


There is a lot of literature on theory and different approaches to time series 
analysis. One of the most famous approaches is the so-called Box-Jenkins ap- 
proach as described in [BJ76] and [And76], e.g., which includes separate model 
identification, parameter estimation, and model checking steps. Detailed dis- 
cussions of other methods and their mathematical and statistic background 
can be found for example in [And71], [Ken73], [Pan83], [KO90], [Pan91], 
[BD91], [Ham94], and [BD96]; more recent research and applications are for 
example given in [PTTO1], [Cha01], [Dei04], [Wei06], and [MJKO7]. 


The main principle can be formulated in the following way: For a given 
target time series T storing the values T(1),...,7(n) and a given set of variables 
Xı,..., Xy we search for a model f that describes T as 


Ty) = f (Xira Xit) Xi ft-tmaz)? 


XN) XN(t-1)) -9 XN (t-tmas)) tét 


where tmax is the maximum number of past values, and e; is an error term. 
If the target variable’s values are also allowed to be considered, then a so-called 
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autoregressive part is added so that we search for a model f so that 


Tt) = F(X iw, Xit-1) ii s Xi(t-tmaz)? 


X(t) XN(t-1); sue XN (t-tmax)? 
Te-1)s -s Tt-tmas) ) t €t 


Of course, the field of applications of time series analysis is huge and in- 
cludes for example astronomy, sociology, economics, or, which is what we are 
going to do in the course of the application examples given in this section, 
the analysis of physical systems. Of course it is not at all natural that any 
physical system, may it be technical or not, can be represented by a simple 
and easily understandable model. In this context the authors strongly recom- 
mend reading Eugene P. Wigner’s article “The Unreasonable Effectiveness of 
Mathematics in the Natural Sciences” [Wig60]. In this article Wigner points 
out that, although so many natural phenomena such as, e.g., gravitation or 
planetary motion can be described by astoundingly simple equations, it is not 
at all natural that “laws of nature” exist and even much less that man is able 
to discover them. 

Especially in the context of analyzing physical systems, the models which 
are to be created for describing a system can be seen as so-called virtual 
sensors: The goal is to develop models of sufficient quality so that these 
models (functions) can be used instead of real sensors, i.e., they are virtual 
sensors. Of course, these virtual sensors can be used in various ways, for 
example also in addition to real sensors enabling fault detection. 

In this section we will concentrate on time series analysis with genetic pro- 
gramming: GP is used for evolving models that describe target time series 
using other data time series collections. Of course we in principle use the GP 
methods for structure identification described in the previous sections, but 
some time series specific details are to be described here, especially a time 
series specific evaluation operator described in Section 11.1.1. Test results are 
given and discussed in Section 11.1.2. 


11.1.1 Time Series Specific Evaluation 


In principle there is no reason why one should not use means squared errors 
or any other of the evaluation functions presented in Section 9.2.3.3 for evalu- 
ating time series models produced by GP. Still, in time series we do not only 
want to produce models that approximate the given target values, but also 
the dynamics of the underlying system that are represented in the measured 
data. Thus, we also want to estimate a model’s quality with respect to the 
local changes in the data as well as the accumulated values. 

This can be done by calculating the differential and integral values. For a 
given time series x, the differential of order o is defined as diff (x, 0) and the 
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integral as int(x): 


diff (x, 0); = a = Xio (11.1) 
int(x); = J j1 Xi (11.2) 


for each index i € [1; |x|]. 

For evaluating a time series model m on the basis of target values o we 
calculate all respective values e by evaluating m and then calculate the com- 
bined fitness values (as described in Section 9.2.5.2) for the plain values, the 
differential (of a predefined order o), and the integral values. These partial 
results are weighted using the coefficients c1, cz, and c3, and the final result 
is calculated in the following way: 


TS(o, €, O, Cplain, Cdif f , Cint, C1, C2; c3) : 


a, = COM B(0, e,n, Cplain) (11.3) 

ag = COM B(diff (o, o), diff (e, o), n, Caiff) (11.4) 

a3 = COM B(int(o), int(e), n, Cint) (11.5) 
2L 

TS(o, €, 0, N, Cplain, Cdif f , Cint, C1, C2, c3) = eS (11.6) 


3 
Dini ci 


with Cplain, Cdiff, ANd Cint being the coefficients needed by the combined 
evaluation function for weighting the partial MEE, VAF, and R? results as 
well as the maximum negative and positive errors. 

Of course, early stopping of model evaluations as described in Section 9.2.5.5 
is also possible for this time series evaluation function. 


11.1.2 Application Example: Design of Virtual Sensors for 
Emissions of Diesel Engines 


The first research work of members of the Heuristic and Evolutionary Al- 
gorithms Laboratory (HEAL) in the area of system identification using GP 
was done in cooperation with the Institute for Design and Control of Mecha- 
tronical Systems (DesCon) at JKU Linz, Austria. The framework and the 
main infrastructure was given by DesCon who maintain a dynamical motor 
test bench (manufactured by AVL, Graz, Austria) shown in Figure 11.1. A 
BMW diesel motor is installed on this test bench, and a lot of parameters 
of the ECU (engine control unit) as well as engine parameters and emissions 
are measured; for example, air mass flows, temperatures, and boost pressure 
values are measured, nitric oxides (NO, to be described later) are measured 
using a Horiba Mexa 7000 combustion analyzer, and an opacimeter is used 
for estimating the opacity of the engine’s emissions (in order to measure the 
emission of particulate matters, i.e., soot). 


238 Genetic Algorithms and Genetic Programming 


FIGURE 11.1: Dynamic diesel engine test bench at the Institute for Design 
and Control of Mechatronical Systems, JKU Linz. 


During several years of research on the identification of NO, and soot emis- 
sions, members of DesCon have tried several modeling approaches, some of 
them being purely data-based as for example those using artificial neural net- 
works (ANNs). Due to rather unsatisfactory results obtained using ANNs, the 
ability of GP to produce reasonable models was investigated in pilot studies; 
we are here once again thankful to Prof. del Re for initiating these studies. 

In this context, our goal is to use system identification approaches in order 
to create models that are designed to replace or support physical sensors; we 
want to have models that can be potentially used instead of these physical 
sensors (which can be damageable or simply expensive). This is why we are 
here dealing with the design of so-called virtual sensors. 


11.1.2.1 Designing Virtual Sensors for Nitric Oxides (NO,.) 


In general, being able to predict NO, emissions on-line (i.e., during engine 
operation) would be very helpful for low emissions engine control. While NO, 
formation is widely understood (see for example [4RLFt05] and the references 
given therein), the computation of NO, turns out to be too complex and - 
at the moment - not easy to be used for control. The reason for this is 
that in theory it would be possible to calculate the engine’s NO, emissions 
if all relevant parameters (pressures, temperatures, ...) of the combustion 
chambers were known, but (at least at the moment) we are not able to measure 
all these values. 


As already mentioned above, ANNs have been used for data-based model- 


Data-Based Modeling with Genetic Programming 239 


ing of NO, emissions of a BMW diesel engine. These results were not very 
satisfying, as is for example documented in [dRLF*05]: Even though mod- 
eling quality on training data was very good, the model’s ability to predict 
correct values for operating points not included in the training data was very 
poor. 

We therefore designed and implemented a first GP approach based on 
the HeuristicLab 1.0; preliminary results were published in [WAW04a] and 
[WAW04b]. In [WAW04b] we documented the ability of GP using offspring 
selection to produce reasonable models for NO,, including lots of statistics 
showing that the results obtained applying rigid offspring selection were sig- 
nificantly better than those obtained without using OS or even OS with less 
strict parameter settings, i.e., lower success ratio and comparison factor pa- 
rameters. 

NO, values were recorded by DesCon members following the standard pro- 
cedure defined by the Federal Test Procedure (FTP); a whole standardized 
test run is therefore called a FTP cycle. FTP tests were executed on the 
DesCon test bench in two different ways as it is possible to activate or to 
deactivate exhaust gas recirculation (EGR). In principle, recirculating a por- 
tion of an engine’s exhaust gas back to the engine cylinders is called EGR; 
the incoming air is intermixing with recirculated exhaust gas, which lowers 
the adiabatic flame temperature and reduces the amount of excess oxygen (at 
least in diesel engines). Furthermore, the peak combustion temperature is 
decreased; since the formation of NO, progresses much faster at high temper- 
atures, EGR can also be used for decreasing the generation of NO,. Further 
information about EGR and its effects on the formation of NO, can for ex- 
ample be found in [Hey88] and [vBS04]. 

We shall therefore here take a closer look at the following two modeling 
tasks: 


e Situation (1): Use data recorded with deactivated EGR; 
e Situation (2): Use data recorded with activated EGR. 


In both cases the data were recorded at 20 Hz; the execution of the cycles 
took approximately 23 minutes. In total, 33 variables are recorded; here we 
do not give a total overview of the statistic parameters of these variables but 
rather restrict ourselves to the linear correlation of the input variables to the 
target variable: All linear correlations! of the potential input variables and 
the target variable NO, are summarized in Table 11.1; all variables were 
filtered using a median filter of order five? before calculating the correlation 
coefficients. 


lWe here use the same standard formula for calculating linear correlation coefficients of 
time series as described in Section 9.4.1. 

2 Applying a median filter means that a moving window is shifted over the data and all 
samples are replaced by the median value of their respective data environment. For calcu- 
lating the filtered value y; using median filtering of order 5 we collect the original values 
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Table 11.1: Linear correlation of input variables and the target values (NO,) 
in the NO, data set I. 


Variable orrelation Coefficient Variable Sorrelation Coefficient 
| Situation (1) | Situation (2) Situation (1) | Situation (2) 


time [0141 [ _-0.129 [alpha EE 0-462 
72 [arr [0a COR E 0.259 
OT. L000 WV ALS 0.853 

mTorr | — onr) — oss | memes | — 0708] 0-416 

MEMES? | — omoj _____0.488_]|_ MEMES | — 00r] 0.054 

MEMES | — ooo] —— 0000| MEMES | — 0027] -0007 

MEMES | — ooo] —— 000o | MEMES | — 0072] 0-015 

ME-MESS_ | — om2) 0.401 ]| MEMES | osoo] 0:592 

ME-MESIO | 0135| 0133 | MEMES | — 04%] 0:532 

ME-MES2 | — 023| — 032 | MEMES | —— 0052] 0-376 

MEMES] — oo| —— 010| MEMES | —— 0267] 0.314 

ME-MESIO | 0.392 [| ____0.478 | MEMES | 0738] 0.470 

N-MOTOR | ____0.404 [0.413 | OPA_OPAC [___0.248 | 0.419 

T-EXH 08a | aE TINK [ TSS 0-004 

TILVR | — osi) ___0.315_| T-O7L____| oor] KEEN 


THO.V RK i -0.077 0.205 [| TWA i 0.149 0.064 


Obviously, activating EGR significantly increases the correlation of NO, 
and all exhaust variables such as CO2 or THC, for example. 

So, in addition to this, the next question is whether to incorporate gas emis- 
sions as for example CO» in the modeling process; of course, estimating NO, 
is a lot easier if CO2 is known since there is a high correlation (especially 
when EGR is activated), but NO, models that do not need CO, informa- 
tion are more useful as they can be applied without having to measure other 
emission values. Furthermore, we also excluded the variables alpha, COH, 
COL, THC, M_TO1F, ME_MES01 — 07, ME_MES10, ME_MES14, and 
ME_M ES17 from the set of valid input variables for building models that do 
not incorporate exhaust information. 

We applied GP using populations of 700 individuals for modeling the mea- 
sured NO, data; 1-elitism was applied, the mutation rate was set to 0.07, 
and rigid offspring selection was applied (maximum selection pressure: 300). 
The first 3,000 samples (representing 2.5 minutes) of the data sets were ne- 
glected; in strategy (1) the samples 3,001 — 10,000 were used as training data, 
in strategy (2) the samples 3,001 — 13,000. The rest of the data was used as 
validation / test samples. 

Amongst other tests, we attacked modeling situation (1) without using ex- 
haust information (hereafter called test strategy (1)), and modeling situation 
(2) using exhaust information (test strategy (2)); both test strategies were ex- 
ecuted 5 times independently leading to the mean squared errors on training 
data summarized in Table 11.2. 


Ti—2, Ti-1, Ti, Li41, and x;42; after sorting these values we get x, j for j € [1,5] with 
£; j < £; 54, for j € [1,4]. yi is then set to the median value of xj, i.e., yi = £} 3- 
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Table 11.2: Mean squared errors on training data for the NO, data set I. 


Test Strategy (1) | Test Strategy (2 
Average | 49.867 


Minimum | 43.408 


Let us now have a closer look at the best models (with respect to training 
data) produced for these test scenarios; their evaluations are both displayed 
in Figures 11.2 and 11.3, respectively. 
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NOx_vKit) = ((({0.950183*T_OEL(t-21)}*(({1,145998"T_LLVK(t-38)}-{1,023461*ME_MES17(t-28)])/({1,065444*ME_MES10(t-18)]+[0,951174*N_MOTOR(t-20)])))-((({0.825300* 
ME_MES9(t-27)]+([0.871124*ME_MES9(t-22))/[0,933514*ME_MES9(t-13))))/({1,006088°N_MOTOR(t-21 )|/[0,991764*ME_MES9(t-26)]))*((([0,891293" 
ME_MES9(t-15)}/[0,705167°N_MOTOR(t-26)])/((0.947497°N_ MOTOR(t-25)]/{1,136678°ME_MES11 (t-8)}))/(([1,082096°N_MOTOR(t-1)}#{1,286062"T_LLVK(t-37)])/ 
-9,654))))+((([1,205990°T_OEL(t-30)}+{0,937356"T_OEL(t-10)])*({1,065581°T_LLVK(t-40)}/({0,928334"T_OEL(t-6)]+{1,184806°ME_MES7(t-12)])))-((0,888947" 
ME_MESQ(t-14))*((((1,181614*ME_MES9(t-28)}/[0,666162*N_ MOTOR(t-25)})/({0,631016"N_ MOTOR(t-24)}/{1,301366*ME_MES11(t)}))/(((0.86721 1* 


N_MOTOR(t-40)]+[1,049105*T_LLVK(t-34)))-7,628))))) 


FIGURE 11.2: Evaluation of the best model produced by GP for test strategy 
(1). 


The best model for test strategy (1) has a worse fit on test data 
(mse€tes:(best;) = 60.636 in contrast to Mse€training(best1) = 43.408); the 
best model for test strategy (2) surprisingly even has a better fit on test data 
(msetest(best2) = 5.809) than on training data (msez;-ain(best2) = 11.259). 

We also tested standard GP without offspring selection, but with propor- 
tional as well as tournament (k = 3) parent selection, 1000 individuals, 2000 
iterations, 7% mutation rate and the same data base as the one described 
previously. 
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NOx_vKit) = (((({0,838023°CO2_vK(t-6)}-((0,985178*T_LLVK(t-3)}9,329))+(({1,096131*ME_MES13(t-14)}-9,835)/((1,075515°ME_MES13(t-2)}+1,117)))+(((9,269-(1,084379"THC_vK(t-8)))- 
{(1,033333"T_ABGAS(t-13)}+[COL_vK(t-17)))/((-7,819+[0,908894°N_ MOTOR(t-14))*{1,216018*ME_MES13(t-15)))))-((0,768676"CO2_vK(t-10)}-[0,899741* 
ME_MES13(t-3)})((((1,089508°ME_MES3(t-9)}((1,200488°N_MOTOR(t-9)}*[1,150031°ME_MES9(t-6)))+((0,724729"CO2_vK(t-4)]+-1,733))+8,329))) 


FIGURE 11.3: Evaluation of the best model produced by GP for test strategy 
(2). 


Especially the use of proportional selection did not yield reasonable results, 
the evaluation of the best model for test strategy (1) returned mean squared 
error 110.23 on training data, and for the best for test strategy (2) the mean 
squared error was 21.34. The results obtained using tournament selection, 
which is suggested in GP literature (as for example in [KKS*03a] or [LP02}), 
were a lot better, but still not as good as those produced by extended GP: The 
best model for test strategy (1) showed mean squared error 61.92 on training 
data, and the best for test strategy (2) showed mean squared error 14.33. 
These results were no surprise, especially as we had seen on synthetic data 
sets that GP using rigid OS and gender specific parent selection performs a 
lot better than standard GP ([WAW04b], [Win04]). 

Comparing these results to those achieved using other methods, we saw that 
they were indeed promising, but still not completely satisfactory. In fact, we 
then started a series of data-based tests using GP in the context of the analysis 
of mechatronical systems; this encouraged us to enforce research on the use 
of extended GP concepts in the identification of mechatronical systems. 


11.1.2.2 Designing Virtual Sensors for Particulate Emissions 
(Soot) 


A lot of research work was done by DesCon members on the identification 
of particulate emissions of a BMW diesel engine. The main results have been 
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published in [AdRWL05] and [LAWR05]; we shall here only summarize these 
results in a rather compact way. 

In short, first attempts to use GP for producing models for soot were not 
very successful; GP did not produce any useful solution without restriction of 
the search space. Therefore, a two step approach was used: 

“In a first step, a statistical analysis was done on the basis of steady state 
measurements. Expert knowledge was combined with statistical correlations 
to yield an accurate steady state model. The advantage of steady state anal- 
ysis is the secure validation of the model; any delay time or sensor dynamics 
are irrelevant. However, such a model could never meet the requirements of 
estimating the highly dynamical process of an IC engine. Therefore the steady 
state model is used as origin for the genetic programming cycle.” (Taken from 
[AdRWLO05] where this static model is given in detail.) 

Using this static (steady state) model, an additional variable was calculated 
and inserted into the set of potential input variables; this so enhanced variables 
set was then used as basis for data-based identification of soot. 

This extended data basis was used by two modeling approaches, namely a 
neural network training algorithm as well as GP; the best results for the ANN 
approach were achieved using a network structure with 2 hidden layers and 
25 hidden nodes per layer. The parameters of the GP-based training algo- 
rithm were set to our standard GP settings (1000 individuals, 10% mutation 
rate, rigid OS, 1-elitism). Again, the data were measured during a standard 
FTP engine test lasting approximately 23 minutes; the first approximately 8 
minutes were taken as training, the rest as validation / test data set. 

Figure 11.4 shows a detail of the evaluation of the models produced by GP 
and ANN on validation data: As we see clearly, both virtual sensors do not 
capture the behavior completely correctly, but the GP model’s fit seems to 
be better than the one of the ANN model. This suspicion becomes clearer by 
analyzing the distribution of errors which is shown in Figure 11.5: The errors 
caused by the evaluation of the model produced by GP are more symmetric 
than those of the ANN? which can be considered an indication for a rather 
good model. The cumulative errors of these models are shown in 11.6, and 
we here see that the model produced by GP is able to reproduce the engine’s 
cumulated soot emissions quite well. 

Again, these results were by far not completely satisfactory; of course, the 
ANN model could be improved by changing the network structure or the 
number of training iterations, and the GP process was not enhanced with 
local optimization or pruning operations. Still, again, these results sustained 
our confidence in GP’s ability to produce reasonable models for mechatronical 
systems. 


3In addition to GP and ANN, an auto-regressive moving-average with exogenous inputs 
(ARMAX) modeling approach was also calculated for reasons of comparison; the distribu- 
tion of the errors caused by the evaluation of this model are also shown in Figure 11.5. 
Please see [BJ76] for explanations and application examples of ARMA(X) models. 
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FIGURE 11.4: Evaluation of models for particulate matter emissions of a 


diesel engine (snapshot showing the evaluation of the model on validation / 
test samples), as given in [AdRWLO5]. 
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FIGURE 11.5: Errors distribution of models for particulate matter emissions, 
as given in [AdRWLO5]. 


11.1.2.3 NO, Data Sets Used for Further Empirical Studies 


The NO, data set described previously in this section was used for several 
research activities of DesCon members as well as in our project investigating 
GP for the design of virtual sensors. Nevertheless, in the course of further 


Data-Based Modeling with Genetic Programming 245 


40 


20 


cumulative error [%] 


-20 


-40 


-60 i 22 woe 
a 


0 400 800 12 
time [s] 


FIGURE 11.6: Cumulative errors of models for particulate matter emissions, 
as given in [AdRWLO5]. 


research work several other measurements were recorded and analyzed; two 
of them were also used for test series that will be reported on in the following 
chapters. This is why we describe and characterize these data sets here. 


NO, Data Set II 


Recorded in 2006 by members of the Institute for Design and Control of 
Mechatronical Systems at JKU Linz at the test bench already mentioned, 
this NO, data set includes the variables listed in Table 11.3. The data set 
available in this context again contains measurements taken from a 2 liter 4 
cylinder BMW diesel engine. Again, several emissions (including NO,, CO, 
and CO2) as well as several other engine parameters were recorded at 100 
Hz and downsampled to 20 Hz. 22 signals were recorded over approximately 
18 minutes, but only 9 variables were considered in further identification test 
series. 

Several variables were measured over approximately 30 minutes at 100 Hz 
recording frequency; they have been downsampled to 20 Hz, so that the result- 
ing data set includes ~36,000 samples. From the variables recorded several 
have been removed (as for example CO, CO2, and time) due to irrelevance 
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or high correlations with the target variable Nox_true; the 10 remaining vari- 
ables are characterized in Table 11.3. Figure 11.7 shows a graphical represen- 
tation of the target values over the whole recording time. 

The variable NOx_Can represents values given by a quick, but also rather 
imprecise estimation for the NO, emissions; the actual NO, emissions were 
again measured using a Horiba Mexa 7000 combustion analyzer; the respective 
values are stored in variable Nox_true. 


Table 11.3: Statistic features of the identification relevant variables in the 
NO, data set II. 


Variable [Miinimam [Maximum [Mean | Varlance 
Eng nA E EEE 
AFSCD-mAirPerCyl | -44.56 | 1,161.36 | 453.12 | 60,952.03 
vsa D0 [5.00 ooj ssr L703 


NOxCAN C oao) iT] 
T_OEL | 78.68 100.83 87.57 31.05 


T) Noriruc [8246 | 1115.23 [235.25 | 60,673.98 
mOra PDs | owo) 10| os| om0 
InjCrv-qMTiDes__| 00) sa| pe) ar 
InjCro-phiMTIDes_| 386) 1061| 280| 1870 


BPSCD_pFltVal | 986.20 2,318.00 | 1214.89 | 104,434.00 


CO] CO] NI] Oo] OY AB] Co] DO] FE] o 


All pairwise linear correlations+ are summarized in Table 11.4; again, 
all variables were filtered using a median filter of order 5 be- 
fore calculating the correlation coefficients. Obviously, there is a 
rather high linear correlation between the target variable and the in- 
put variables BPSCD_pFltVal and NOx-CAN; the values stored in 
AFSCD_mAirPerCyl and InjCrv_qM I1Des are also remarkably correlated 
to the designated target values. 


NO, Data Set III 


During the time in which we were doing the research work discussed here, 
maintenance work was repeatedly done at the DesCon test bench; amongst 
other aspects, several sensors were removed or replaced by newer ones. 


4We here use the same standard formula for calculating linear correlation coefficients of 
time series as described in Section 9.4.1. 
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FIGURE 11.7: Target NO, values of NO, data set II, recorded over approx- 
imately 30 minutes at 20Hz recording frequency yielding ~36,000 samples. 


The third NO, data set was recorded in 2007 by members of DesCon; 
again, several variables were measured at the test bench while testing a 2 liter 
4 cylinder BMW diesel engine (simulated vehicle: BMW 320d Sedan). The 
mean engine speed was set to 2,200 revolutions per minute (rpm), and in each 
engine cycle 15mg fuel were injected. 

Once again, several emissions (including NO,, CO, and CO2) as well as sev- 
eral other engine parameters were recorded; this time the measurements were 
recorded over approximately 18.3 minutes at 100 Hz and then downsampled 
to 10 Hz, yielding a data set containing ~11,000 samples. The target values 
(the engine’s NO, emissions measured by a Horiba combustion analyzer) are 
stored in variable HoribaNO«. 

In [Win08], tests have been documented in which we have used this data 
set for testing the ability of GP to incorporate physical knowledge. For this 
purpose we have also used a synthetic variable H FM*: 


HFM _ 1000 


a 2oy 11. 
HFM x a (11.7) 
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Table 11.4: Linear correlation coefficients of the variables relevant in the NO, 
data set II. 


(MOE HOG On On On On On On On me 
Olen tes [Loo [oso [os 70.70] 0.01] 0.05] 0.00 | 0.05 | 0.08 | 0.75 
D AFSCDmAtmPerog || 080_[ 100 [078 [0.90] 0.80] 0.91] 0.60 | 0-88 | 0.63 | 0.05 
BV SACD Out oss ore oo 0.73] 0.77] 0.78 | 0.58 | 0.77] 0.63 | 0.81 
3) NOz-CAN L070 [090 [07s | 100 | 074] 0.95 | 0.05 | 0.56] 0.58 | 0.94 
2) T-OEL | osr | oso | or [or [1.00] 0.78 | 0.51 | 0.75 | 0.49 | 0.81 
3) (T) NOwtrue | oss [001 [ors [0.93 [0.78 | 1.00] 0.61 | 0.90 | 0.60 | 0.95 
6) InjCrv-qPiliDes [0.59 [0.60 [0.38 [0.03 [0.51 | 0.01 | 1.00 [0.70 | 0.03 | 0.62 
7) InjCrv-aMTiDes | oss [0.88 [077 [0.86 [0.75 | 0.90 | 0.70 | 1.00 | 0.50 | 0.87 
8) InjCru-phiMTiDes | 0.08 | 0.03 | 0.03 | 0.58 | 0.49 | 0.00 | 0.03 | 0-50 | 1.00 | 0.66 
0) BPSODpFrival lors [095 | ost] oot] 081 | 0.95 | 0.61 | 0.87] 0.66} 1.00 


This synthetic variable is also included in NO, data set III; detailed expla- 
nations regarding the meaning of this additional variable can be found in 
[Win08]. 

Figure 11.8 visualizes all target HoribaNOz values available (in total ap- 


proximately 11,000 samples); Figure 11.9 shows a detail of these data, namely 
the HoribaNOz of samples 6000 — 7000. 


FIGURE 11.8: Target HoribaNOz values of NO, data set III. 


In detail, Table 11.5 summarizes the main statistic parameters of the vari- 
ables relevant in this identification task. Again, all pairwise linear correlations 
have also been calculated, with the results summarized in Table 11.6; all vari- 
ables were again filtered using a median filter of order 5 before calculating 
the correlation coefficients. As we see in this table, there are no remarkably 
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FIGURE 11.9: Target HoribaNOz values of NO, data set II, samples 6000 
— 7000. 


high correlations except for the obvious one between HFM and HF'M*; the 
correlation coefficient of HF M™* and the target, HoribaNOz, is above aver- 
age (0.72), but not high enough to build a reasonable model only using this 
variable as input. 


In Sections 11.3, 11.4, and 11.6 we will present research results achieved 
using these NO, data sets IT and III. 
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Table 11.5: Statistic features of the variables in the NOx data set III. 


Variable [Minimum [Maximum | Mean | Variance 
OKORA oor] oom] oil] oom 
TMT C oono ao 15.282 | 16.990 
2) pMT C or son sea) 632% 
3) PI [0.000 20) oo) 0.627 
3) APT C oo| coo aes 38 
5) pRATL [487.900 [927.400 | 709.355 |13331040 
ODN [1,906.00 | 2.507.000 | 2.208.384 | 27,608.381 
7) pBOOST | _ 981.000 1906.000 | 1209.811 | 28.618.435 
3 ATM C oae ono oao 22 
9) HFM C oo er o oo 


Table 11.6: Linear correlation coefficients of the variables relevant in the NO, 
data set II. 


[Nos | oMi oMi | ai | trl | pRAIL| _N | pBOOSI | HFM | HFM™ 
Try Nos | 100 | 001] 015 | 015] 001] 005] 017] 0.5 EE 

0.0. ] 100] 00>] 007] 050 | 001] 00s | 007 | 0.70] 

[_0.15_]_0.03_|_1.00_] -0.03 [0.11 [0.01] 018 | 0.05] _-0.06_| 
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11.2 Classification 
11.2.1 Introduction 


Classification is understood as the act of placing an object into a set of 
categories, based on the object’s properties. Objects are classified according 
to a (in most cases hierarchical) classification scheme also called taxonomy. 
Amongst many other possible applications, examples of taxonomic classifica- 
tion are biological classification (the act of categorizing and grouping living 
species of organisms), medical classification, and security classification (where 
it is often necessary to classify objects or persons for deciding whether a prob- 
lem might arise from the present situation or not). A statistical classification 
algorithm is supposed to take feature representations of objects and map them 
to a special, predefined classification label. Such classification algorithms are 
designed to learn (i.e., to approximate the behavior of) a function which maps 
a vector of object features into one of several classes; this is done by analyzing 
a set of input-output examples (“training samples” ) of the function. Since sta- 
tistical classification algorithms are supposed to “learn” such functions, we are 
dealing with a specific area of machine learning and, more generally, artificial 
intelligence. 

In a more formal way, the classification problem can be formulated in the 
following way: Let the data consist of a set of samples, each containing k 
feature values xj41,..., iz and a class value y;. What we look for is a function 
f that maps a sample x; to one of the c classes available: 


ff: XC; (11.8) 
Via € X): f(x) = f(a,..., £k) =y;y © {C1,...,Ce} (11.9) 


where X denotes the feature vector space and C the set of classes. 

There are several approaches which are nowadays used for solving data min- 
ing and, more specifically, classification problems. The most common ones are 
(as for example described in [Mit00]) decision tree learning, instance-based 
learning, inductive logic programming (such as Prolog, e.g.), and reinforce- 
ment learning. 


11.2.2 Real-Valued Classification with Genetic 
Programming 


In this section we shall concentrate on GP-based classification. In fact, we 
will here restrict ourselves to real-valued classification tasks, i.e., 


XCR*,CCR (11.10) 


Thus, we can apply the GP-based system identification approach described in 
the previous sections; the representations of the problems (Section 9.2.2), the 
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solution candidates (9.2.4), and the genetic operators (9.2.4.2) can be used 
without any restrictions. 

The only critical aspect is that the evaluation and the quality estimation of 
classifiers have to be modified: Evaluating a model m on a set of input features 
(£z1,..., £k) will lead to a target value y € R, but y does not necessarily have 
to be exactly one certain class value, i.e., we might get y ¢ C. The exact 
mapping of feature vectors and their respective target values to class values 
is done using sets of thresholds t1,...,t--1 placed between the class values 
Ci, sey Ge: 


V(i € [1;c— 1]) :Ci <t; < Ci41 (11.11) 


Based on a set of thresholds T we can classify a sample for which the target 
value y has been calculated as belonging to class c; using the mapping function 


i 


fi:{R,R} SC ; a= fT) (11.12) 

y< tı > f'(a, T) = C1 (11.13) 

y>te-1 > f'(a, T) = Ce (11.14) 

Vi € [l;c— 2]) : ti < y < tipi > f(c, T) = Cii (11.15) 


11.2.3 Analyzing Classifiers 
11.2.3.1 Classification Rates and Confusion Matrices 


When it comes to analyzing classifiers, the most important aspect is of 
course how many samples are classified correctly. For each feature vector 
sample x we have an original classification y, and by applying the classifier 
which is to be evaluated we get the predicted class y’. As described before, 
this classification of x is done using a classification model yielding y = f(x) 
and an optional post-processing step using thresholds T yielding y = f'(y, T). 

Let us assume that we analyze n samples 21... (classified into c classes 
C,...C.) with their respective original classifications y1...n; by applying a 
classification model m we get the respective predicted classifications yį. „ as 
described above. The ratios of correctly classified samples for all classes or 
each class separately are calculated as cc and cc;, respectively: 


— Wg Eln] Aw = yl 
n 


cc (11.16) 


iij E[lin] Ay =y A y= Ci 
Wetec SUA EGA OE. iiy 
lj: j € [lin] A yj = Cil 
For more detailed analysis, confusion matrices [KP98] contain information 
about actual and predicted classifications done by classification systems. In 
general, a confusion matrix cm is a table containing c x c cells that states 
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how many samples of each given class are classified as belonging to a specific 
class; for example, each column of the matrix can represent the instances of a 
predicted class while each row represents the instances in the original (actual) 
class (or vice versa). So, the value cm,;,; stores the number of samples of class 
i that are classified as class j. 

An example is given in Table 11.7 in which each row of the matrix represents 
the instances in a predicted class while each column represents the instances in 
the original (actual) class; additionally, the numbers of samples not classified 
(nc, ...NCe) are also given as well as the total rate of correct classifications. 
Please note that the sum of all cells has to be equal to the number of samples 


n, i.e., 
XX emig + Do nc =n (11.18) 
i=1 


i=1 j=1 


Table 11.7: Exemplary confusion matrix with three classes 


Actual Class 
“y” tO? sgy 


Estimated T | om om | oa 
Class 2 [em 2 [em | omz] 

g fema | emas Lomas] 

Not classified | ncı | nco | ncs | 


Correct Classifications Ratio 


The special case of binary classification into two classes (i.e., c = 2) is 
frequently found as it is in many applications necessary to decide for given 
samples whether or not some given condition is fulfilled. There are the four 
different possible outcomes of a single predicted (estimated) classification in 
the case of binary classification into classes “positive” (“yes,” “1,” “true”) 
and “negative” (“no,” “0,” “false” ): 


e A false positive classification is done when a sample is incorrectly clas- 
sified as “positive” which is in fact “negative,” 


e a false negative classification is done when a sample is incorrectly clas- 
sified as “negative” which is in fact “positive,” and 


e true positive as well as true negative classifications are respective correct 
classifications. 


A typical “positive / negative” example is given in Table 11.8: 
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Table 11.8: Exemplary confusion matrix with two classes 
Actual Class 
Positive Negative 
Estimated Positive | a (true positive) | b (false positive) 
Class Negative | c (false negative) | true negative 


In this case, 


e the accuracy is defined as ACC = —%*4 


e the true positive rate (also called sensitivity) as TP = so, 
e the true negative rate (also called specificity) as TN = pop 


e the false positive rate as FP = ma (which is in fact the probability of 


classifying a sample as “positive” when it is actually “negative” ), 


Cc 


e the false negative rate as FN = > TT (which is in fact the probability of 
classifying a sample as “negative” when it is actually “positive”), and 
finally 


a 


e the precision as P = =p 


11.2.3.2 Receiver Operating Characteristic (ROC) Curves 


Receiver operating characteristic (ROC) analysis provides a convenient 
graphical display of the trade-off between true and false positive classifica- 
tion rates for two class (binary) problems [FE05]. Since its introduction in 
the medical and signal processing literatures ([HM82], [Zwe93]), ROC analysis 
has become a prominent method for selecting an operating point; for a recent 
snapshot of applications and methodologies see [FBF+03] and [HOFLe04]. 
ROC analysis often includes the calculation of the area under the ROC curve 
(AUC). 

In the context of two class classification, ROC curves are calculated in the 
following way: For each possible threshold value discriminating two given 
classes (e.g., 0 and 1, “true” and “false” or “positive” and “negative”), the 
numbers of true and false classifications for one of the classes are calculated. 
For example, if the two classes “true” and “false” are to be discriminated 
using a given classifier, a fixed set of equidistant thresholds is tested and the 
true positives (TP) and the false positives (FP) are counted for each of them. 
Each pair of TP and FP values produces a point of the ROC curve; examples 
are graphically shown in Figure 11.10. Slightly different versions are also often 
used; for example the positive predictive value (= TP / (TP + FP)) or the 
negative predictive value (= TN / (FN + TN)) could be displayed instead. 
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FIGURE 11.10: Two exemplary ROC curves and their area under the ROC 
curve (AUC). 


The most common quantitative index describing an ROC curve is the area 
under it. The bigger the area under a ROC curve is, the better the discrim- 
inator model is; if the two classes can be ideally separated, the ROC curve 
goes through the upper left corner and, thus, the area under it reaches its 
maximal possible value which is exactly 1.0. 


This method is very useful for analyzing the qualtity of two class classifiers, 
but unfortunately it is not directly applicable for more than two classes. When 
it comes to measuring or graphically illustrating the quality of multi-class 
classifiers, one possibility is to define symmetric areas around the original 
class values; for each class value C; the corresponding area is defined as [C; — 
r,C; +r]. Successively increasing the parameter value r from 0 to Citys 
and calculating the numbers of correct and incorrect classifications for each r 
yields a set of pairs of FP/TP values. Jiang and Motai [JM05], for example, 
use this technique for illustrating and analyzing the classification performance 
in the context of automatic motion learning. 

Although this method can be used very easily, it is not generally applicable 
because it is restricted to symmetric areas. Emerson and Fieldsend [FE05] 
propose a different approach and define the ROC surface for the Q-class prob- 
lem in terms of a multi-objective optimization problem in which the goal is 
to simultaneously minimize misclassification rates when the misclassification 
costs and parameters governing the classifier’s behavior are unknown. The 
problem with this approach is that the estimated Pareto fronts presented in 
[FE05] can be illustrated and used for graphical interpretation for a classifi- 
cation problem involving not more than three classes. This is why we here 
in the following section propose the use of sets of ROC curves for each class 
separately. 
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11.2.3.3 Sets of Receiver Operating Characteristic Curves and 
their Use in the Evaluation of Multi-Class Classification 


In this section we present an extension to ROC analysis making it possible 
to measure the quality of classifiers for multi-class problems. Unlike other 
multi-class-ROC approaches which have been presented recently (see [FE05] 
or [Sri99], e.g.) we propose a method based on the theory of ROC curves that 
creates sets of ROC curves for each class that can be analyzed separately or 
in combination. Thus, what we get is a convenient graphical display of the 
trade-off between true and false classifications for multi-class problems. We 
have developed a generalization of this AUC analysis for multi-class problems 
which gives the operator the possibility to see not only how accurately, but 
also how clearly classes can be separated from each other. 

The main idea presented here is that for each given class C; the numbers of 
true and false classifications are calculated for each possible pair of threshold 
between the classes C;_,; and C; as well as between C; and Cj4,. This is 
in fact done under the assumption that the c classes are ordered and that 
Ci < Ci41 holds for every i € [1, (n — 1)] (with c being the number of classes). 

For a given class C; the corresponding TP and FP values (on the basis of 
the N original values ð and estimated values €) are calculated as: 


V((ta, to) |(Ci-1 < ta < Ci) AN (Ci <ty< Ci+1)) (11.19) 
T P(ta, to) = Hej 3 (ta < ej < ty) A (ta < 0j < ty) }| (11.20) 
FP(ta, to) = Hej : (ta <ej < ty) A (0; < ta V oj > ty) }| (11.21) 


This approach has been published first in [WAW06d] and then described in 
detail (including application examples) in [WAW07]. 

The resulting tuples of (FP,TP) values are stored in a matrix which can be 
plotted as is exemplarily illustrated in Figure 11.11: On the basis of synthetic 
data 10? = 100 ROC points for 10 thresholds between the chosen class C; and 
Ci—ı as well as between C; and C;41 were calculated. This obviously yields 
a set of points which can be interpreted in analogy to the interpretation of 
“normal” ROC curves: The closer the points are located to the upper left 
corner, the higher is the quality of the classifier at hand. 

For getting sets of ROC curves instead of ROC points, the following change 
is introduced: An arbitrary threshold tg between the classes Cj;_; and C; is 
fixed and the FP and TP values for all possible thresholds tẹ between C; and 
Ci41 are calculated. What we get is one single ROC curve; this calculation 
is executed for all possible values of ta (i-e., for all possible threshold between 
C;-1 and C;). This procedure also has to be executed the other way around, 
i.e., also has to choose an arbitrary threshold tẹ between C; and Ci+1, calculate 
all corresponding ROC points, and repeat this for all values for all possible 
values of ta. 

Finally, what we get is a set of ROC curves; an example showing 10 ROC 
curves is given in Figure 11.11. 
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FIGURE 11.11: An exemplary graphical display of a multi-class ROC 
(MROC) matrix. 


Of course this procedure cannot be executed in exactly this way for the 
classes C and Ch. For c it is only possible to calculate the ROC points (and 
therefore the ROC curve) for all possible thresholds between Cı and C2; for 
Ce this is done analogically with all possible thresholds between C._; and Ce. 
This is why sets of ROC curves can be calculated for the classes C2... Ce—1 
whereas only simple ROC curves can be produced for C1 and Ce. 

As already mentioned in the previous section, the area under the ROC 
curve (AUC) is a very common quantitative index describing the classifier’s 
quality. In the context of multi-class ROC (MROC) curves the two following 
values can be calculated assuming that all m ROC curves for a given class 
have already been calculated: 


e The maximum AUC (Maz AUC) is the maximum of all areas under the 
ROC curves calculated for a specific class. It measures how exactly this 
class is separated from the others using the best thresholds parameter 
setting. 

MaxAUC = max, AUC(ROC;) 


e The average AUC (AvgAUC) is calculated as the mean value of all areas 
under the ROC curves for a specific class. It measures how clearly this 
class is separated from the others since it takes into account all possible 
thresholds parameter settings. 


2 i=1.m AUC(ROC:) 


m 


Avg AUC = 
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We will in the following turn to a topic very much related to what we have 
discussed in the previous sections, namely the evaluation of classifiers evolved 
by GP. 


11.2.4 Classification Specific Evaluation in GP 


Of course, there is on the one hand no reason why standard evaluation 
functions such as the MSE / MEE, VAF, or R? functions could not be used 
for estimating the quality of classification model during the GP process. The 
reason for this is that we here want the identification algorithm to produce 
a model that is able to reproduce the given target data as well as possible, 
similar to when dealing with regression or time series analysis. 

Still, on the other hand the evaluation of classification models may also 
include several aspects for which the standard evaluation functions are not 
suitable. This is why we shall describe several aspects that may contribute to 
a classification specific evaluation function for GP solution candidates in the 
context of real-valued learning of classifiers with genetic programming. 


11.2.4.1 Preprocessing of Estimated Target Values 


Before we compare original and estimated class values we suggest the fol- 
lowing classification specific preprocessing step: 

The errors of predicted values that are lower than the lowest class value or 
greater than the greatest class value should not have a quadratic or even worse, 
but rather partially only linear contribution to the fitness of a model. To be 
a bit more precise: Given n samples with original classifications o; divided 
into c classes C),...,C. (with Cı being the lowest and Ce the greatest class 
value), the so preprocessed estimated values preproc(e;) shall be calculated 
as follows: 


y(i € [1, n]) : 
(ei < C1) = preproc(e;,z) = Cy — (C1 — e;)? (11.22) 
(ei > Ce) = preproc(e;,x) = Ce + (ei — Co)? (11.23) 


with x being an exponential parameter which depends on the evaluation func- 
tion that uses these preprocessed values. For example, when using the mean 
squared error or any other function that incorporates the use of squared dif- 
ferences between original and estimated value, x is to be set to 2, whereas 
when using the MEE function it has to be set to the chosen exponent. 

The reason for this is that values that are greater than the greatest class 
value or below the lowest value are anyway classified as belonging to the 
class having the greatest or the lowest class number, respectively; using a 
standard evaluation function without preprocessing of the estimated values 
would punish a formula producing such values more than necessary. 
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11.2.4.2 Considering Standard Evaluation Functions 


For quantifying the quality of classifiers we can use all functions described 
in Section 9.2.5; in contrast to standard applications, we can also apply these 
functions for each class individually. 

In the standard case, all n values are evaluated using the MEE, VAF, 
and R? values as well as the minimum and maximum errors errormin, and 
erroTmax; these can optionally be calculated using the preprocessed values 
preproc(e;) instead of e; for all i € [1;n]. Thus, we get partial values mee, 
vaf and r?, errormin and errormax Which can be weighted using the factors 
Wmee, Wvaf, Wr2, Werrmins and Werrmax* 

This approach of course does not consider the distribution of samples to 
the classes; for example, if 98% of the samples belong to class 0 and only 2% 
to class 1, then the evaluation of a model classifying all samples as 0 will be 
fairly good when using these standard evaluation functions even though this 
classifier is more or less useless. 

In order to overcome this problem we could for example sample the data 
so that all classes are represented by the same number of samples; we instead 
here describe the application of these evaluation functions to the classes given 
separately: 

The sets of estimated values ec;...ec,. contain the values estimated for 
each class C,...C., and in analogy to this the sets oc, ... oc. are sets of the 
corresponding class values: 


V(i € [l;n]) : oi = k > ei E€ ecg, 0; E€ OCK (11.24) 


Additionally, we also need class weights w1 ... we (with w = X; wi) and 
can so calculate the partial fitness values as 


1 c 
mee =~ >, mee(oc;, eci, n) -w ( ) 
1 c 
= — Sr? (0c;, eci) -wi 11.26 
A >, r“ (oci, eci) + W ( ) 
—_— var (oc; — eci) 
ee ja a 11.27 
oag w >, ( var(o) ) = ( ) 
1 c 
€TTO min = — 5 Tmin(OCi, eci) + wi (11.28) 
et 
1 c 
max — — max is Ci) > Wi 11.29 
error F 2 Tmaz(0Ci, CC;) -w ( ) 


Again, these values can optionally be calculated using the preprocessed values 
preproc(e;) instead of e; for all i € [1;n]. 

Of course, the adjusted functions described in Section 9.2.5.3 could be used 
instead of the standard functions. 
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11.2.4.3 Considering Classification Specific Aspects 


We propose the consideration of the following classification specific aspects 
in the evaluation of classifier models: 


e The range of the values estimated for each of the given classes, 


e how well the classes are separated correctly from each other depending 
on the choice of appropriate thresholds, and 


e the area under ROC curves or, in the case of multi-class classification, 
the area under sets of MROC curves. 
Class Ranges 


For calculating the class ranges cr; ...cre we definitively need the sets of 
estimated values for each class, ec, .. . ece: 


V(i € [1;c]) : er; = maz(ec;) — min(ec;) (11.30) 


and can so calculate the class ranges’ contribution cr as 
c 
cr = 5 Cri Wy (11.31) 
i=1 


Figure 11.12 exemplarily displays several samples with original class values 
C1, C2, and C3; the class ranges result from the estimated values for each 
class and are indicated as cr1, cr2, and cr3. 


Thresholds Analysis 


As is indicated in Figure 11.12 we do not only want to consider class ranges 
but also a more classification-like approach. Between each pair of contiguous 
classes we set m equally distributed temporary thresholds: 


Cin — Ci 


V(a € [l;e—1)V(K € [l1;m]): tin =Ci+k- awe 


(11.32) 


Then, for each threshold we count the numbers of samples which are classified 
incorrectly; here we also consider a given matrix storing misclassification pun- 
ishments mcp for each pair of classes giving the misclassification punishment 
for classifying a sample of class a as class b as mcpq,» for all a and b in [1; c]: 


Y(i € [1;c— 1))V(K € [1;m]) : VQ € [1 n]) : 


MCP; i+1 * Fay Oj < tik Ne; > tik 
pli, kj) = 4 MCPpi+i,i: ia : 0j > tik Nej < tik (11.33) 
0 : else 


p(t, k) =); P(t, k, j) (11.34) 
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@ Original Class Values 


© Estimated Values 


FIGURE 11.12: Classification example: Several samples with original class 
values C1, C2, and C3 are shown; the class ranges result from the estimated 
values for each class and are indicated as cr1, cr2, and cr3. 


assuming that a sample j is (temporarily) classified as class (i +1) if e; > tir 
and as class i if ej < tik; freda is the frequency of class a, i.e., the number 
of samples that are originally classified as belonging to class a. 


The thresholds’ contribution to the classifier’s fitness, thresh, can be now 
calculated in two different ways: We can consider the minimum sum of pun- 
ishments for each pair of contiguous classes as 


c—1 


thresh = > MINKE [1m] P(t, k) (11.35) 


i=l 


or consider all thresholds which are weighted using threshold weights tw1...m 
as 


c—1 
thresh = X m —— 5 = i,k) - tw (11.36) 
i= k= it 


Normally, we define the threshold weights tw using minimum and maximum 
weights, weighting the thresholds at near to the original class values minimally 
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and those in the “middle” maximally: 


tw, = tWmins tWm = tWmin3 tWrange = tWmar — tWmin (11.37) 
l= m/2 
tw, = twi41 = tWmax 

m mod 2 = 0 y(i € [2;1-—1]): (11.38) 


twi = twin + He . (i — 1) 


y(i € [l+1;m -— 1]) : tWm-i+1 = twi 


l= (m + 1)/2 
tw, = tWmax 
mmod 2 = 1 v(i € [2;} — 1]) : (11.39) 
twi = tWmin + Bos -(¢-1) 


Vi € 1+1; m -— 1]): tWm-—i+1 = tu; 


(M)ROC Analysis 


Finally, we also consider the area under the (M)ROC curves as described in 
Section 11.2.3.3: For each class we calculate the AUC values for ROC curves 
and sets of MROC curves (with a given number of thresholds checked for each 
class), and then we can either use the average AUC or the maximum AUC 
for each class weighted with the weighting factors already mentioned before: 


visi {ee , AvgAUC(C;)-w; : consider average AUCs (11.40) 


X ;-; Mav AUC(C;)- wi : consider maximum AUCs 


11.2.4.4 Combined Classifier Evaluation 


As we have now compiled all information needed for estimating the quality 
of a classifier model in GP, CLASS, we calculate the final overall quality 
using respective weighting factors: 


a5 = error mae C5 (C5 = Werrmas) (11.41) 
C6 = Wer) 

C7 = Wthresh) 

C8 = Wauc) 


ag = CT - C6 
ay = thresh - c7 
ag = auc: Cg 


CLASS(0,e) = Siti 


i=1 Ĉi 


( 
( 
( 
a4 = erTrormin ` C4 (C4 = Werrmin) 
( 
( 
( 
( 
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11.2.5 Application Example: Medical Data Analysis 
11.2.5.1 Benchmark Data Sets 


For testing GP-based training of classifiers here we have picked the following 
data sets: The Wisconsin Breast Cancer, the Melanoma, and the Thyroid data 
sets. 


e The Wisconsin data set is a part of the UCI machine learning reposi- 
tory’. In short, it represents medical measurements which were recorded 
while investigating patients potentially suffering from breast cancer. 
The number of features recorded is 9 (all being continuous numeric 
ones); the file version we have used contains 683 recorded examples (by 
now, 699 examples are already available since the data base is updated 
regularly). 


e The Thyroid data set represents medical measurements which were 

recorded while investigating patients potentially suffering from hypo- 
or hyperthyroidism; this data set has also been taken from the UCI 
repository. In short, the task is to determine whether a patient is hy- 
pothyroid or not. Three classes are formed: Euthyroid (the state of 
having normal thyroid gland function), hyperthyroid (overactive thy- 
roid), and hypothyroid (underactive thyroid). 
In total, the data set contains 7200 samples. The samples of the Thyroid 
data set are not equally distributed to the three given classes; in fact, 
166 samples belong to class “1” (“subnormal functioning” ), 368 samples 
are classified as “2” (“hyperfunction” ), and the remaining 6666 samples 
belong to class “3” (“euthyroid”); a good classifier therefore has to be 
able to correctly classify significantly more than 92% of the samples sim- 
ply because 92 percent of the patients are not hypo- or hyperthyroid. 
21 attributes (15 binary and 6 continuous ones) are stored in this data 
set. 


e The Melanoma data set represents medical measurements which were 
recorded while investigating patients potentially suffering from skin can- 
cer. It contains 1311 examples for which 30 features have been recorded; 
each of the 1311 samples represents a pigmented skin lesion which has to 
be classified as a melanoma or a nonhazardous nevus. This data set has 
been provided to us by Prof. Dr. Michael Binder from the Department 
of Dermatology at the Medical University Vienna, Austria. 

A comparison of machine learning methods for the diagnosis of pig- 
mented skin lesions (i.e., detecting skin cancer based on the analysis 
of visual data) can be found in [DOMK*01]; in this paper the au- 
thors describe the quality of classifiers produced for a comparable data 


Shttp://www.ics.uci.edu/~mlearn/. 
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Table 11.9: Set of function and terminal definitions for enhanced GP-based 
classification. 
Functions 
Description 
Addition 
Multiplication 
Subtraction 
Division 
Exponential Function 
If [Arg0] then return [Then] branch ([Arg1]), 
otherwise return [Else] branch ([Arg2]) 
Less or equal, greater or equal 
Logical AND, logical OR 


‘Terminals 
Name Description 
var T, C Value of attribute x multiplied with coefficient c 
const A constant double value d 


collection using k-NN classification, ANNs, decision trees, and SVMs. 
The difference is that in the data collection used in [DOMK*01] all le- 
sions were separated into three classes (common nevi, dysplastic nevi, 
or melanoma); here we use data representing lesions that have been 
classified as benign or malign, i.e., we are facing a binary classification 
problem. 


All three data sets were investigated via 10-fold cross-validation. This 
means that each original data set was divided into 10 disjoint sets of (ap- 
proximately) equal size. Thus, 10 different pairs of training (90% of the data) 
and test data sets (10% of the data) can be formed and used for testing the 
classification algorithm. 


11.2.5.2 Solution Candidate Representation Using Hybrid Tree 
Structures 


The selection of the functions library is an important part of any GP mod- 
eling process because this library should be able to represent a wide range of 
systems; Table 11.9 gives an overview of the function set as well as the ter- 
minal nodes used for the classification experiments documented here. As we 
can see in Table 11.9, mathematical functions and terminal nodes are used as 
well as Boolean operators for building complex arithmetic expressions. Thus, 
the concept of decision trees is included in this approach together with the 
standard structure identification concept that tries to evolve nonlinear math- 
ematical expressions. An example showing the structure tree representation 
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of a combined formula including arithmetic as well as logical functions is dis- 
played in Figure 11.13. 


-0348 * 
Sitoma2 
it) 


0876 * 
Sintomal 
it) 


FIGURE 11.13: An exemplary hybrid structure tree of a combined formula 
including arithmetic as well as logical functions. 


11.2.5.3 Evaluation of Classification Models 


There are several possible functions that can serve as fitness functions 
within the GP process. For example, the ratio of misclassifications (using op- 
timal thresholds) or the area under the corresponding ROC curves ([Zwe93], 
[Bra97]) could be used. Another function frequently used for quantifying the 
quality of models is the R? function that takes into account the sum of squared 
errors as well as the sum of squared target values; an alternative, the so-called 
adjusted R? function, is also utilized in many applications. 

We have decided to use a variant of the squared errors function for estimat- 
ing the quality of a classification model. There is one major difference of this 
modified mean squared errors function to the standard implementation of this 
function: The errors of predicted values that are lower than the lowest class 
value or greater than the greatest class value do not have a totally quadratic, 
but partially only linear contribution to the fitness value. To be a bit more 
precise: Given N samples with original classifications o; divided into n classes 
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C1,---;C€n (with cı being the lowest and cn the greatest class value), the fitness 
value F of a classification model producing the estimated classification values 
e; is evaluated as follows: 


v(i € [1, NJ): 
(ei < c1) = fi = (0; a er)?+ | Cy — ej |, 
(1 < ei < en) > fi = (ei —0,)’, (11.42) 


(ei > cn) > fi = (0i —en)?+ | cn — | 


1 N 
F= wot (11.43) 


The reason for this is that values that are greater than the greatest class 
value or below the lowest value are anyway classified as belonging to the 
class having the greatest or the lowest class number, respectively; using a 
standard implementation of the squared error function would punish a formula 
producing such values more than necessary. 


11.2.5.4 Finding Appropriate Class Thresholds: Dynamic Range 
Selection 


Of course, a mathematical expression alone does not yet define a classifica- 
tion model; thresholds are used for dividing the output into multiple ranges, 
each corresponding to exactly one class. These regions are defined before 
starting the training algorithm in static range selection (SRS, see for example 
[LC05] for explanations), which brings along the difficulty of determining the 
appropriate range boundaries a priori. In the GP-based classification frame- 
work discussed here we have therefore used dynamic range selection (DRS) 
which attempts to overcome this problem by evolving the range thresholds 
along with the classification models: Thresholds are chosen so that the sum 
of class-wise ratios of misclassifications for all given classes is minimized (on 
the training data, of course). 

In detail, let us consider the following: Given N (training) samples with 
original classifications o; divided into n classes c1,...,Cn (with cı being the 
lowest and cn the greatest class value), models produced by GP can be in 
general used for calculating estimated values e; for all N samples. Assuming 
thresholds T = t),...,tn-1 (with cj < t; < cj41 for j € [l;n — 1]), each 
sample k is classified as ecg: 


ek < tı > ec, (T) = (11.44) 
tj < ek < tj+1 = eck(T) = Cj41 (11.45) 
ek > tn—-1 => eck(T) = cn (11.46) 
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Thus, assuming a set of thresholds Tm, for each class cg we get the ratio of 
correctly classified samples cr, as 


totaly(Tm) = |{a: (V(x € a) : Ox = eck (Tm))}| (11.47) 

correct (Tm) = |{b: (V(x E€ b) : Oz = eck(Tm) A 0x = ce) }| (11.48) 
_ correct;,(Tm) 

erp (Im) = totals (a) (11.49) 


The sum of ratios of correctly classified samples is — dependent on the set of 
thresholds Tm — calculated as 


n 


cr(Tm) = X cri(Tm) (11.50) 


i=l 


So, finally we can define the set of thresholds applied as that set Topi so 
that each other set of thresholds leads to equal or lower sums of classification 
accuracies®: 


Ta # Topt > er(Ta) < er(Lopt) (11.51) 


These thresholds, that are optimal for the training samples, are fixed and also 
applied on the test samples. 

Please note that this sum of class-wise classification accuracies is not equal 
to the total ratio of correctly classified samples which is used later on in 
Sections 11.2.5.5 and 11.2.5.8; the total classification accuracy for a set of 
thresholds acc(T,) (assuming original and estimated values o and e) is defined 
as 


z(Tm) = |{a|(V(a € a) : og = ecz(Tm))}| (11.52) 


acc(Tm) = N (11.53) 


11.2.5.5 First Results, Identification of Optimal Operators and 
Parameter Settings 


As first reported in detail in [WAW07], during our thorough test series we 
have identified the following GP-relevant parameter settings as the best ones 
for solving classification problem instances: 


e GP-algorithm: Enhanced GP using strict offspring selection. 
e Mutation rate: 10% — 15%. 
e Population size: 500 — 2,000. 
6Please note here that it could happen that more than one combination of thresholds can 


be optimal, simply because there could be more than one optimal threshold for any given 
pair of class values. This is why we here give an inequation in (11.51). 


268 


Genetic Algorithms and Genetic Programming 


e Selection operators: Whereas standard GA implementations use only 


one selection operator, the SASEGASA requires two, namely the so- 
called female selection operator as well as the male selection operator. 
Similar to our experience gained during the tests on the identification 
of mechatronical systems, it seems to be the best to choose the roulette- 
wheel selection in combination with the random selection operator. The 
reason for this is that apparently merging the genetic information of 
rather good individuals (models, formulas) with randomly chosen ones 
is the best strategy when using the SASEGASA for solving identification 
problems. 


Success ratio and selection pressure: As for instance described 
in [AW04b], there are some additional parameters of the SASEGASA 
regarding the selection of those individuals that are accepted to be a 
part of the next generation’s population. These are the success ratio 
and the maximal selection pressure that steer the algorithm’s behavior 
regarding offspring selection. For model structure identification tasks 
in general and especially in case of dealing with classification problems, 
the following parameter settings seem to be the best ones: 


— Success ratio = 1.0, and 


— Maximum selection pressure = 100 — 500 (this value has to be 
defined before starting a identification process depending on other 
settings of the genetic algorithm used and the problem instance 
which is to be solved). 


As has already been explained in further detail in previous chapters, 
these settings have the effect that in each generation only offspring sur- 
vive that are really better than their parent individuals (since the success 
ratio is set to 1.0, only better children are inserted into the next gen- 
eration’s population). This is why the selection pressure becomes very 
high as the algorithm is executed, and therefore the maximum selection 
pressure has to be set to a rather high value (as, e.g., 100 or 500) to 
avoid premature termination. 


Crossover operators: We have implemented and tested three dif- 
ferent single-point crossover procedures for GP-based model structure 
identification: One that exchanges rather big subtrees, one that is de- 
signed to exchange rather small structural parts (e.g., only one or two 
nodes), and one that replaces randomly chosen parts of the respective 
structure trees. Moreover, for each crossover operator we have also im- 
plemented an extended version that additionally randomly mutates all 
terminal nodes (i.e., manipulates the parameters of the represented for- 
mula). The following 6 structure identification crossover operators are 
available: StandardSPHigh, StandardSPMedium, StandardSPLow, Ex- 
tendedSPHigh, ExtendedSPMedium, and ExtendedSPLow. 
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Since arbitrarily many crossover operators can be selected when apply- 
ing the SASEGASA’, the task was not to find out which operator can 
be used to produce the best results but rather which subset of operators 
is to be chosen. According to what we experienced, the following set 
of crossover operators should be applied: All three standard operators 
(StandardSPHigh, StandardSPMedium, and StandardSPLow) plus one 
of the extended ones, for instance ExtendedSPLow. 


e Mutation operators: The basic mutation operator for GP structure 

identification we have implemented and tested, GAStandard, works as 
already described in Chapter 9: A function symbol could become an- 
other function symbol or be deleted; the value of a constant node or 
the index of a variable could be modified. Furthermore, we have also 
implemented an extended version (GA Extended) that additionally ran- 
domly mutates all terminal nodes (in analogy to the extended crossover 
operators). 
As the latest test series have shown, the choice of the crossover oper- 
ators influences the decision which mutation operator to apply to the 
SASEGASA: If one of the extended crossover operators is selected, it 
seems to be best to choose the standard mutation operator. But if only 
standard crossover methods are selected, picking the extended mutation 
method yields the best results. 


Selected experimental results of the standard GP implementation and the 
SASEGASA algorithm for the Thyroid data set using various parameter set- 
tings are presented in Table 11.10. For each parameter settings version the 
10-fold cross validation test runs were executed, the resulting average results 
are listed. In all cases, the population size was 1000; furthermore, the follow- 
ing parameter settings were used: 


(1) crossover: ExtendedSPMedium; mutation: GAStandard; selection: 
roulette. 


(2) crossover: StandardSPMedium; mutation: GAExtended; selection: 
roulette. 


(3) crossover: all 6 available operators; mutation: GAEztended; selection: 
random and roulette (maximum selection pressure: 500). 


(4) crossover: all 6 available operators; mutation: GAStandard; selection: 
Random and roulette (maximum selection pressure: 500). 


‘Using more than one crossover operator within the SASEGASA does not mean using a 
combination of several operators for creating one new solution, but rather in the following 
way: Every time a new child is to be produced using two parent individuals, one of the 
given crossover operators is chosen randomly; the chance of being applied is equal for each 
operator. 
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Table 11.10: Experimental results for the Thyroid data set. 


Using standard GP implementation 
Parameter Correct classifications 
settings 


Parameter Correct classifications 
settings Prognosis 
3 96.34% 
(4) 98.07% 
(5) 97.25% 
(6) 98.53% 


(5) crossover: all 3 standard operators plus ExtendedSPLow; mutation: GA- 
Standard; selection: roulette and roulette (maximum selection pressure: 


500). 


(6) crossover: all 3 standard operators plus ExtendedSPLow; mutation: GA- 
Standard; selection: random and roulette (maximum selection pressure: 
500). 


As an example, the model produced for cross validation partition 3 using 
the parameter settings combination (6) is shown in Figure 11.17. 


These insights have been used also in the more extensive test series docu- 
mented later on in this chapter. 


11.2.5.6 Graphical Classifier Analysis 


Graphical analysis can often help analyzing results achieved to any kind of 
problem; this is of course also the case in machine learning and in data-based 
classification. 

The most common and also simplest way how to illustrate classification 
results is to plot the target values and the estimated values into one chart; 
Figure 11.14 shows a graphical representation of the best result obtained for 
the Thyroid data set, cross-validation set 9. 

In Figure 11.15 we show 4 ROC chart examples that were generated for the 
classes ‘0’ and ‘2’ of the Thyroid data set, 10-fold cross validation set number 
9: 


(a) ROC curve for an unsuitable classifier for class ‘2’, evaluated on training 
data; 
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Table 11.11: Summary of the best GP parameter settings for solving classifi- 
cation problems. 


Parameter Optimal Value 
GP algorithm SASEGASA (single population, 
i.e., GP with offspring selection) 

Mutation rate 10% — 15% 
Population size 1,000 
Mazimum selection pressure 100 — 1,000 
Parent selection operators Random, roulette 
StandardSPLow, 

Crossover StandardSPMedium, 
Operators StandardSPHigh, 
ExtendedSPLow 

Mutation operator GAStandard 


Ratio of weighting the evaluation 
contributions (SumOfSquaredErrors : 
separability : class ranges) 4:1: 1 


(b) ROC curve for the best identified classifier for class ‘0’, evaluated on 
training data; 


(c) ROC curve for the best identified classifier for class ‘0’, evaluated on 
test data; 


(d) ROC curve for the best identified classifier for class ‘2’, evaluated on 
test data. 


In Figure 11.16 finally we show 4 MROC chart examples that were generated 
for the intermediate classes ‘1’ of the Thyroid data set, again on the basis of 
10-fold CV-set number 9: 


(a) MROC curve for an unsuitable classifier for class ‘1’, evaluated on train- 
ing data; 


(b) MROC curve for an unsuitable classifier for class ‘1’, evaluated on test 
data; 


(c) MROC curve for the best identified classifier for class ‘1’, evaluated on 
training data; 


(d) MROC curve for the best identified classifier for class ‘1’, evaluated on 
test data. 


On the webpage of this book® interested readers can find a collection of 10 
example models (exactly one for each partition of the 10-fold cross-validation) 


Shttp://gagp2009.heuristiclab.com/. 
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FIGURE 11.14: Graphical representation of the best result we obtained for 
the Thyroid data set, CV-partition 9: Comparison of original and estimated 
class values. 


for the Thyroid data set, produced by GP; optimal thresholds are given as 
well as resulting confusion matrices for each data partition. 


11.2.5.7 Classification Methods Applied in Detailed Test Series 


For comparing GP-based classification with other machine learning meth- 
ods, the following techniques for training classifiers were examined: Genetic 
programming (enhanced approach using extended parents and offspring selec- 
tion), linear modeling, neural networks, the k-nearest-neighbor method, and 
support vector machines. 


GP-Based Training of Classifiers 

We have used the following parameter settings for our GP test series: 
e Single population approach; population size: 500 — 1000 
e Mutation rate: 10% 
e Maximum formula tree height: 8 


e Parent selection: Gender specific (random and roulette) 


FIGURE 11.15: ROC curves and their area under the curve (AUC) values for 


True Classifications 


True Classifications 
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(b) 


AUC: 
0.7490008 


True Classifications 


AUC: 
0.9930586 


False Classifications 


(c) 


AUC: 
0.9435751 


True Classifications 


False Classifications 


(d) 


AUC: 
0.9983904 


False Classifications 


False Classifications 


classification models generated for Thyroid data, CV-set 9. 


1-elitism 


Termination criteria: 


Fitness functions: 


— Maximum selection pressure: 100 


Function set: All functions as described in Table 11.9. 


Offspring selection: Strict offspring selection (success ratio as well as 
comparison factor set to 1.0) 


— Maximum number of generations: 1000; not reached, all executions 
were terminated via the 
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(a) [class 1, training] 


True Classifications 


True Classifications 


(c) [class 1, training] 


Avg. AUC: Avg. AUC: 
0.3076477 0.9721313 
——— 
we Max. AUC: Max. AUC: 
a 0.364 7628 0.9981604 
False Classifications False Classifications 
(b) [class 1, test] (d) [class 1, test] 
OP eaa o 
c c 
S 2 f 
gS a 
[E] oO 
g S 
2 = 
È E 
Avg. AUC: Avg. AUC: 
0.3607539 09740631 
Max. AUC: Max. AUC: 
0.4361191 0.9976785 


False Classifications 


False Classifications 


FIGURE 11.16: MROC charts and their maximum and average area under 
the curve (AUC) values for classification models generated for Thyroid data, 
CV-set 9. 


— In order to keep the computational effort low, the mean squared 
errors function with early abortion was used as fitness function for 
the GP training process. 


— The eventual selection of models is done by choosing those models 
that perform best on validation data (or, if no validation samples 
are specified, then the models’ performance on training data is con- 
sidered). For this selection we have used the classification specific 
evaluation function described in Section 11.2: The mean squared 
error is considered as well as class ranges, thresholds qualities, and 
AUC values; all other possible contributions have been neglected 
in the test series reported and discussed here. Thus, cı = 4.0, 
ck = 1.0 for k € {6,7,8}, and c, = 0.0 for k € {2,3,4,5}. 
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FIGURE 11.17: Graphical representation of a classification model (formula), 
produced for 10-fold cross validation partition 3 of the Thyroid data set. 


In addition to splitting the given data into training and test data, extended 
GP-based training is implemented in such a way that a part of the given 
training data is not used for training models and serves as validation set; in 
the end, when it comes to returning classifiers, the algorithm returns those 
models that perform best on validation data. This approach has been chosen 
because it is assumed to help to cope with overfitting; it is also applied in other 
GP-based machine learning algorithms as for example described in [BL04]. In 
fact, this was also done in our standard GP tests for the Melanoma data set. 


Linear Modeling 


Given a data collection including m input features storing the informa- 
tion about N samples, a linear model is defined by the vector of coefficients 
61..m- For calculating the vector of modeled values e using the given input 
values matrix u1,..m, these input values are multiplied with the corresponding 
coefficients and added: 


€=U1..m*9 (11.54) 


The vector of coefficients can be computed by simply applying matrix divi- 
sion. For conducting the test series documented here we have used the matrix 
division function provided by MATLAB®: 


theta = InputValues \ TargetValues; 


If a constant additive factor is to be included into the model (i.e., the coeffi- 
cients vector), this command has to be extended: 
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r = size(InputValues,1); 
theta = [InputValues ones(r,1)] \ TargetValues; 


Theoretical background of this approach can be found in [Lju99]. 


Neural Networks 


For training artificial neural network (ANN) models, three-layer feed- 
forward neural networks with one output neuron were created using the back- 
propagation as well as the Levenberg-Marquardt training method. Theoret- 
ical background and details can be found in [Nel01] (Chapter 11, “Neural 
Networks”), [Mar63], [Lev44], or [GMW82]. 


The following two approaches have been applied for training neural net- 
works: 


e On the one hand we have trained networks with 5 neurons in the hidden 

layer (referred to as “NN1” in the test series documentation in Section 
11.2.5.8) as well as networks with 10 hidden neurons (referred to as 
“NN2” in the test series documentation); the number of iterations of 
the training process was set to 100 (in the first variant, “NN1”) and 
300 (in the second variant, “NN2”). In the context of analyzing the 
benchmark problems used here, higher numbers of nodes or iterations 
are likely to lead to overfitting (i-e., a better fit on the training data, 
but worse test results). 
The ANN training framework used to collect the results reported in this 
book is the NNSYSID20 package, a neural network toolbox aa 
ing the Levenberg-Marquardt training method for MATLAB® ; it has 
been implemented by Magnus Nørgaard at the Technical University of 
Denmark [Nør00]. 


e On the other hand, the multilayer perceptron training algorithm avail- 
able in WEKA [WF05] has also been used for training classifiers. In 
this case the number of hidden nodes was set to (a + c)/2, where a is 
the number of attributes (features) and c the number of classes. The 
number of iterations was not pre-defined, but 10% of the training data 
were designated to be used as validation data; in order to combat the 
danger of overfitting, the training algorithm was terminated as soon as 
the error on validation data got worse in 20 iterations consecutively. 
This training method, which applies backpropagation learning, is in the 
following referred to as the “NN3” method. 
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kNN Classification 


Unlike other data-based modeling methods based on linear models, neural 
networks or GP, k-nearest-neighbor classification works without creating any 
explicit models. During the training phase, the data are simply collected; 
when it comes to classifying a new, unknown sample Znew, the sample-wise 
distance between £new and all other training samples £train is calculated and 
the classification is done on the basis of those k training samples (xj) y ) show- 
ing the smallest distances from rew- 

The distance between two samples is calculated as follows: First, all features 
are normalized by subtracting the respective mean values and dividing the 
remaining samples by the respective variables’ standard deviation. Given a 
data matrix x including m features storing the information about N samples, 
the normalized values £norm are calculated as 


zlii) — 4 oper zli k) 


Vi € [1, mV € [1, N]) : tnorm(i, j) = o(ali,1...N)) 


(11.55) 


where the standard deviation o of a given variable x storing N values is 
calculated as 


(11.56) 


with z denoting the mean value of x. 
Then, on the basis of the normalized data, the distance between two samples 
a and b, d(a,b), is calculated as the mean squared variable-wise distance: 


1 n 
=-=) ( (anorm(i = bnorm(i))? (11.57) 
i=1 


S 


where n again is the number of features stored for each sample. 

In the context of classification, the numbers of instances (of the k nearest 
neighbors) are counted for each given class and the algorithm automatically 
predicts that class that is represented by the highest number of instances. In 
the test series documented in this book we have applied weighting to kNN 
classification: The distance between £new and any sample x, is relevant for 
the classification statement, and the weight of “nearer” samples is higher than 
that of samples that are “further” away from £nrew- 

There is a lot of literature that can be found for kNN classification; very 
good explanations and compact overviews of kNN classification (including 
several possible variants and applications) are for example given in [DHS00] 
and [RN03]. 
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Support Vector Machines 


Support vector machines (SVMs) are a widely used approach in machine 
learning based on statistical learning theory [Vap98]; an example of the ap- 
plication of SVMs in the medical domain has been reported in [MIB* 00], for 
example. 

The most important aspect of SVMs is that it is possible to give bounds 
on the generalization error of the models produced, and to select the respec- 
tively best model from a set of models following the principle of structural 
risk minimization [Vap98]. SVM are designed to calculate hyperplanes that 
separate the data from each other and maximize the margin between sets of 
data points. While the basic training algorithm is only able to construct linear 
separators, so-called kernel functions can be used to calculate scalar products 
in higher-dimensional spaces; if the kernel functions used are nonlinear, then 
the separating boundaries will be nonlinear, too. 

In this work we have used the SVM implementation described in [Pla99] 
and [KSBM01]; we have used the implementation of this algorithm which is 
available for the WEKA machine learning framework [WF05]. Polynomial 
kernels have been used as well as Gaussian radial basis function kernels with 
the y parameter (defining the inverse variance) set to 0.01 and the complexity 
parameter c set to 10,000. 


11.2.5.8 Detailed Test Series Results 


The results summarized in this section have been partially published in 
[WAW06b], [WAW06e], and [WAW07]. 

Since the Wisconsin and the Thyroid data sets are publicly available, the 
results produced by GP are compared to those that have been published 
previously for various machine learning methods; the Melanoma is not openly 
available, therefore we have used all machine learning approaches mentioned 
for training classifiers for this data set. 

All three data sets were investigated via 10-fold cross-validation (CV). For 
each data collection, each of the resulting 10 pairs of training and test data 
partitions has been used in 5 independent GP test runs; for the Melanoma 
data set, all machine learning algorithms mentioned previously have also been 
applied to all pairs of training and test data, the stochastic algorithms again 
applied 5 times independently. 


Results for the Wisconsin Data Set 


Table 11.12 summarizes the results for the 10-fold cross validation produced 
by GP with offspring selection; these figures boil down to the fact that ex- 
tended GP has in this case been able to produce classifiers that on average 
correctly classify 97.91% of training samples and 97.53% of test samples. 
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Table 11.12: Summary of training and test results for the Wisconsin data set: 
Correct classification rates (average values and standard deviation values) for 
10-fold CV partitions, produced by GP with offspring selection. 


Partition Training 
Avg. Std. Dev. 


97.697 97.067 
97.69 97.657 
98.405 97.917 


Std. Dev. 


98.37% 98.24% 
97.52% 97.06% 
97.95% 97.94% 
0 0 
o o 


In order to compare the quality of these results to those reported in the 
literature, Table 11.13 summarizes test accuracies that have been obtained 
using 10-fold cross validation. For each method listed we give the references 
to the respective articles in which these results have been reported?. Obviously 
the results summarized in Table 11.12 have to be considered surprisingly good 
as they outperform all other algorithms reported in the literature listed here. 
In [LC05], for example, recent results for several classification benchmark 
problems are documented; the Wisconsin data set was there analyzed using 
standard GP as well as three other GP-based classification variants (POPE- 
GP, DecMO-GP, and DecMOP-GP), and the respective results are also listed 
in Table 11.13. 

Of course, for the sake of honesty we have to admit that the effort of GP 
to produce these classifiers is higher than the runtime or memory consumed 
by most other machine learning algorithms; in our GP tests using the Wis- 
consin data set and populations with 500 individuals the average number of 
generations executed was 51.6 and the average number of solutions evaluated 
~1,296,742. 


Results for the Melanoma Data Set 


For the Melanoma data set no results are available in the literature; there- 
fore we have tested all machine learning algorithms mentioned previously for 
getting an objective evaluation of our GP methods. 


9 An even more detailed listing of test results for this data set can be found in [JHC04]. 


280 Genetic Algorithms and Genetic Programming 


Table 11.13: Comparison of machine learning methods: Average test accuracy 
of classifiers for the Wisconsin data set. 


Algorithm | ‘Test Accuracy 
GP with OS 97.53% 
Probit |WHMS03 97.20% 
RLP |BU95 97.07% 
SVM |WHMSO03 96.70% 
C4.5 (decision tree) [HSC96 96.0% 


| 
| 
| 
ANN [TG97 |____95.61% 
| 
| 
| 
| 


DecMOP-GP [LC05 95.60% 
DecMO-GP [LC05 95.19% 
POPE-GP [LC05 95.08% 
StandardGP [LC05 93.82% 


First, in Table 11.14 we summarize original vs. estimated classifications 
obtained by applying the classifiers produced by GP with offspring selection; 
in total, 97.17% of the training and 95.42% of the test samples are classified 
correctly (with standard deviations 0.87 and 2.13, respectively). These GP 
tests using the Melanoma data set were done with populations containing 
1,000 individuals; the average number of generations executed was 54.4 and 
the average number of solutions evaluated ~2,372,629. 


Table 11.14: Confusion matrices for average classification results produced by 
GP with OS for the Melanoma data set. 
Training Original Classification 
[0] (Benign) [1] (Malign) 
Estimated 
Classification 


Estimated 115.18 (87.92% 
Classification 3.33 (2.54% 


Test results obtained using other machine learning algorithms are collected 
in Table 11.15. Support vector machine based training was done with radial 
as well as with polynomial kernel functions. Furthermore we used y values 
0.001 and 0.01. In standard GP (SGP) tests we used tournament parent 
selection (k = 3), 8% mutation, single point crossover and the same structural 
limitations as in GP with OS; in order to get a fair comparison, the population 
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size was set to 1,000 and the number of generations to 2,500 yielding 2,500,000 
evaluations per test run. 

As we can see in Table 11.15, our GP implementation performs approxi- 
mately as well as the support vector machines and neural nets applying those 
settings that are optimal in this test case: GP with OS was able to classify 
95.42% of the test cases correctly, SVMs correctly classified 94.89% — 95.47% 
and neural nets (with validation set based stopping) 95.27% of the test cases 
evaluated. Standard GP as well as kNN, linear regression, and standard ANNs 
perform worse. 

Even though it is nice to see that the average accuracy recorded for models 
produced by GP with OS is quite fine, the relatively high standard deviation 
of this method’s performance (2.13, compared to 0.41 recorded for optimal 
SVMs) has to be seen as a negative aspect of these results. 


Table 11.15: Comparison of machine learning methods: Average test accuracy 
of classifiers_for the Melanoma data set. 


Algorithm Test Accuracy 
Avg. Std. Dev. 
SVM (radial, 7 = 0.01 [ 95.47% | 041 
GP with OS | 95.42% | 2.13 
SVM (polynomial, y = 0.01) | 95.40% | 0.56 
SVM (radial, 7 = 0.001 [95.27% | 0.74 
NN3 | 99.27% | 1.91 
SVM (polynomial, y = 0.001 | 94.89% | 0.83 
NNI | 94.35% | 2.39 
kNN (k=3 | 93.59% | 1.03 
SGP | 93.52% | 3.72 
NN2 | 92.90% | 2.59 
kNN (k =5 | 92.85% | 0.94 
Lin | 92.45% | 2.90 


Results for the Thyroid Data Set 


Finally, the results achieved for the Thyroid data set are to be reported here. 
Table 11.16 summarizes the results for the 10-fold cross validation produced 
by GP with offspring selection. For each class we characterize the classifica- 
tion accuracy on training and test data, giving average as well as standard 
deviation values for each partition. These figures boil down to the fact that 
extended GP has in this case been able to produce classifiers that on average 
correctly classify 99.10% of training samples and 98.76% of test samples, the 
total standard deviation values being 0.73 and 0.92, respectively. 
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Table 11.16: Summary of training and test results for the Thyroid data set: 
Correct classification rates (average values and standard deviation values) for 
10-fold CV partitions, produced by GP with offspring selection. 

Partition 


Training Test 
tase 1| Clase #1 Class 3 | Class 1 | Class 2 | Class 3 
0 avg. | 94.67% | 97.64% | 99.63% | 90.00% | 95.68% | 99.19% 
siden | 170] 26| os| Tis a| o 
1 avg. | 94.93% | 98.67% | 99.01% | 88.75% | 96.76% i 
3.58 4.23 0.46 5.23 5.86 


std. dev. 
. | 96.67% | 98.49% | 99.49% YP 91.25% | 96.22% 
Piss [200 [055 |e || 
90.00% | 95.68% 


3 avg. | 96.00% | 98.19% | 99.15% 
4 avg. | 95.33% | 97.04% | 99.19% | 88.75% | 96.22% 
2.45 5.38 0.35 11.18 3.63 
95.00% | 94.59% 


std. dev. 
. | 95.07% | 96.62% | 99.22% 
Pee | s| oao 
87.50% | 94.59% 
87.50% | 92.97% 


6 aug. | 93.47% | 97.76% | 99.16% 
2.18 7.64 0.49 
7.65 4.52 


std. dev. 
98.80% | 98.97% | 99.16% 
96.25% | 94.05% 
5.23 3.52 


7 aug. 
std. dev. 2.18 5.92 0.49 
94.40% | 98.01% | 99.23% 
91.25% | 92.43% 
90.63% | 94.92% 


9 aug. 
std. dev. 
Avg. aug. 
std. dev. 


97.73% | 96.62% | 99.31% 
aeo oss| 032 
95.71% | 97.80% | 99.26% 
pos] in| oi 


In order to compare the quality of these results to those reported in the 
literature, Table 11.17 summarizes a selection of test accuracies that have 
been obtained using 10-fold cross validation; again, for each method listed 
we give the references to the respective articles in which these results have 
been reported. Obviously, the results summarized in Table 11.16 have to be 
considered quite fine, but not perfect as they are outperformed by results 
reported in [WK90] and [DAGO]]. 

GP has also been repeatedly applied for solving the Thyroid problem; some 
of the results published are the following ones: 

In [LH06] (Table 8), results produced by a pareto-coevolutionary GP classi- 
fier system for the Thyroid problem are reported, and here in Table 11.17 these 
results are stated as the “PGPC” results; in fact, these results are not the 
mean accuracy values but rather the median value, which is why these results 
are not totally comparable to other results stated here. Loveard and Ciesielski 
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[LCO1] reported that classifiers for the Thyroid problem could be produced 
using GP with test accuracies ranging from 94.9% to 98.2% (depending on 
the range selection strategy used). 

According to Banzhaf and Lasarczyk [BL04], GP-evolved programs consist- 
ing of register machine instructions turned out to eventually misclassify on 
average 2.29% of the given test samples, and that optimal classifiers are able 
to correctly classify 98.64% of the test data. 

Furthermore, Gathercole and Ross [GR94] report classification errors be- 
tween 1.6% and 0.73% as best result using tree-based GP, and that a classi- 
fication error of 1.52% for neural networks is reported in [SJW92]. In fact, 
Gathercole and Ross reformulated the Thyroid problem to classifying cases 
as “class 3” or “not class 3”; as is stated in [GR94], it turned out to be rel- 
atively straight-forward for their GP implementation (DSS-GP) to produce 
function tree expressions which could distinguish between classes “1” and “2” 
completely correctly on both the training and test sets. “To be fair, in split- 
ting up the problem into two phases (class 3 or not, then class / or 2) the 
GP has been presented with an easier problem [...]. This could be taken in 
different ways: Splitting up the problem is mildly cheating, or demonstrating 
the flexibility of the GP approach.” (Taken from [GR94].) 


Table 11.17: Comparison of machine learning methods: Average test accuracy 
of classifiers for the Thyroid data set. 


Algorithm Accuracy 
Training Test 
CART |WK90 | 99.80% 99.36% 
PVM |WK90 | 99.80% 99.33% 
Logical Rules |DAGO1 | = 99.30% 
GP |GR94 | — 98.4% — 99.27% 
GP with OS | 99.10% 98.76% 
GP |BL04 | = 97.71% — 98.64% 
GP [LCO1 | - 94.9% — 98.2% 
BP + local adapt. rates [SJW93] | 99.6% 98.5% 
ANN [SJW92 BE 98.48% 
BP + genetic opt. [SJW93 | 99.4% 98.4% 
Quickprop [SJ W93 [99.6% 98.3% 
RPROP |[SJW93 | 99.6% 98.0% 
PGPC |LH06 | E 97.44% 


GP with strict offspring selection was here applied with populations of 1000 
individuals; on average, the number of generations executed in our GP tests 
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for the Thyroid test studies was 73.9, and on average 2,463,635.1 models were 
evaluated in each GP test run. 


11.2.5.9 Conclusion 


We have here described an enhanced genetic programming method that was 
successfully used for investigating machine learning problems in the context 
of medical classification. The approach works with hybrid formula structures 
combining logical expressions (as used for example in decision trees) and classi- 
cal mathematical functions; the enhanced selection scheme originally success- 
fully applied for solving combinatorial optimization problems using genetic 
algorithms was also applied yielding high quality results. 

We have intensively investigated GP in the context of learning classifiers for 
three medical data collections, namely the Wisconsin and the Thyroid data 
sets taken from the UCI machine learning repository and the Melanoma data 
set, a collection that represents medical measurements which were recorded 
while investigating patients potentially suffering from skin cancer. The results 
presented in this section are indeed satisfying and make the authors believe 
that an application in a real-world framework in the context of medical data 
analysis using the techniques presented here is recommended. As documented 
in the test results summary, our GP-based classification approach is able to 
produce results that are — in terms of classification accuracy — at least com- 
parable to or even better than the classifiers produced by classical machine 
learning algorithms frequently used for solving classification problems, namely 
linear regression, neural networks, neighborhood-based classification, or sup- 
port vector machines as well as other GP implementations that have been 
used on the data sets investigated in our test studies. 
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11.3 Genetic Propagation 
11.3.1 Test Setup 


When speaking of analysis of genetic propagation as described in Section 
6.1, we analyze how well which parts of the population succeed in propagating 
their genetic material to the next generation, i.e., to produce offspring that 
will be included in the next generation’s population. In this section we shall 
report on tests in this area; major parts have been published in our article on 
offspring selection and its effects on genetic propagation in GP-based system 
identification [WAW08] as well as in [Win08]. 

We have here used the NO, data set II already presented and described in 
Section 11.1.2.3. Originally, this data set includes 10 variables, each storing 
approximately 36,000 samples; the first 10,000 samples are neglected in the 
tests reported on here, approximately 18,000 samples are training, and 4,000 
samples are validation (which is in this case equivalent to test) data. The last 
~4,000 samples are again neglected. 

In principle, we are using conventional GP (with tournament and propor- 
tional selection) as well as extended GP (with gender specific selection as well 
as offspring selection). The details of the test strategies used are given in 
Table 11.18. 


Table 11.18: GP test strategies. 


Strategy | Properties 


I Pop] = 1000; 
Conventional GP | Tournament parent selection (k = 3) 
nr. of rounds: 1000 
I Pop] = 1000; 
Conventional GP | Proportional parent selection; 
nr. of rounds: 1000 
Il Pop] = 500; 
Extended Gender specific parent selection 
GP (proportional, random); 


Offspring selection 
(SuccessRatio = 1, MaxSelPres = 100) 


In all three test strategies we used subtree exchange crossover, the time 
series analysis specific evaluation function (with early abortion as described 
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in Section 11.1.1) for evaluating solutions, and applied 1-elitism as well as 
15% mutation rate. 


11.3.2 Test Results 


We have executed independent test series with 5 executions for each test 
strategy; the results are to be summarized and analyzed here. 

With respect to solution quality and effort!?, the extended GP algorithm 
clearly outperforms the conventional GP variants (as summarized in Table 
11.19). 


Table 11.19: Test results. 


Best m 

Quality [5381 501190 
(Training) 

Bost in ; 
Quality : 5912.01 


(Test) . | 13,945.23 | 21,315.23 | 16,123.34 
Generations 64.31 


Effort 1,000,000 898,332.23 


Regarding parent analysis, in all test runs we documented the propagation 
count for each individual and sum these over all generations. So we get 


PCtotal (i) = 5 peli) (11.58) 


i€[1;gen] 


for each individual index ¿ and assume that gen is the number of generations 
executed. Additionally, we form equally sized partitions of the population 
indices and sum up the PCtotal values for each partition. 

In Table 11.20 we give the average PCtotal values for percentiles of the pop- 
ulations of test series I, II, and II; for test series I and II we collected the 
PCtotal Of 100 indices for forming a partition, and for test series III we collected 
50 indices for each partition. The Figures 11.18 and 11.19 show pczotai values 
of exemplary test runs of the series I and II summed up for partitions of 10 
solution indices each. Figure 11.20 shows pciota; values of exemplary test runs 
of series III summed up for partitions of 5 solution indices each. 


10The number of solutions evaluated is here interpreted as the algorithm’s total effort. 
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Table 11.20: Average overall genetic propagation of population partitions. 


Population Test Strategy 
Percentile I II II 


27. a 10. n 


FIGURE 11.18: PCtotal values for an exemplary run of series I. 


FIGURE 11.19: PCtotal values for an exemplary run of series II. 
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FIGURE 11.20: PCtotal values for an exemplary run of series ITI. 


As we see from the results given in Tables 11.19 and 11.20 and Figure 11.18, 
there is a rather high selection pressure when using tournament selection; the 
results are rather good and (as expected) less fit individuals are by far not 
able to contribute to the population as well as fitter ones, leading to a quick 
and drastic reduction of genetic diversity. 


The results for test series II, as given in Tables 11.19 and 11.20 and Figure 
11.19, are significantly different: The results are a lot worse (especially on 
training data) than those of algorithm variant I, and obviously there is no 
strong selection pressure as almost all individuals (or, rather the individuals at 
the respective indices) are able to contribute almost to the same extent. Only 
the worst ones are not able to propagate their genetic material to the next 
generations as well as better ones. This is due to the fact that in the presence 
of very bad individuals roulette wheel selection selects the best individuals 
approximately as often as those that perform middlingly well. Especially in 
data-based modeling there are often individuals that score extremely badly 
(due to divisions by very small values, for example), and in comparison to 
those all other ones are approximately equally fit. 


Finally, test series III obviously produced the best results with respect to 
training as well as validation data (see also Table 11.19). Even more, the re- 
sults that are given in Table 11.20, column III, and displayed in Figure 11.20, 
show that the combination of random and roulette parent selection and off- 
spring selection results in a very moderate distribution of the PCtotaų values: 
Fitter individuals contribute more than less fit ones, but even the worst ones 
are still able to contribute to a significant extent. Thus, genetic diversity is 
increased which also contributes positively to the genetic programming pro- 
cess. 


11.3.3 Summary 


Thus, in order to sum up this section, offspring selection in GP-based system 
identification significantly influences the algorithm’s ability to create high 
quality results as well as the genetic propagation dynamics: Not only fitter 
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individuals are able to propagate their genetic make-up, but also less fit ones 
are able to contribute to the next population. This is also somehow the case 
when using proportional selection, but in the presence of individuals with 
very bad fitness values the selection pressure is almost lost which leads to 
solutions of rather bad quality. When using offspring selection, extremely bad 
individuals are eliminated immediately; when using OS in combination with 
gender specific parent selection (applying random and proportional selection 
mechanisms), GP is able to produce significantly better results than when 
using standard techniques. Parents diversification and thus increased genetic 
diversity in GP populations is considered one of the most influential aspects 
in this context. 


11.3.4 Additional Tests Using Random Parent Selection 


In addition to the tests reported on in the previous parts of this section we 
have also tested conventional as well as extended GP using random parent 
selection. Thus, we have two more test cases to be analyzed. 


Table 11.21: Additional test strategies for genetic propagation tests. 


Strategy | Properties 


Conventional GP | Random parent selection 


nr. of rounds: 500 


Extended 
GP 


Random parent selection 
Offspring selection 
(SuccessRatio = 1, MaxSelPres = 100) 


As we had expected, the test results obtained for standard GP with random 
parent selection were very bad; obviously, no suitable models were found. 
When using OS, on the contrary, the test results for random parent selection 
were not that bad at all: The models are (on training data) not quite as good 
as those obtained using random/roulette and OS or conventional GP with 
tournament parent selection, but still they perform (surprisingly) well on test 
data!!. In Table 11.22 we summarize the respective result qualities. 


110f course, these remarks are only valid for the tests reported on here - here we do not 
give any general statement regarding result quality using random parent selection and OS. 
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Table 11.22: Test results in additional genetic propagation tests (using random 


parent selection). 
V 
Bex im 7 
Quality 3 
(Training) ; 
Bost im 
Quality 
(Test) ; 12,053.9 
Generations 
Tiori 


In Table 11.23 we give the average PCtotal values for percentiles of the pop- 
ulations of test series IV and V (collecting the pctora: values of 200 indices 
for forming a partition for series IV and 50 indices for each partition for se- 
ries V). Obviously (and exactly as we had expected) random parent selection 
leads to all individuals having the approximately same success in propagating 
their genetic make-up. When using OS, the result is (even a little bit surpris- 
ingly) significantly different: Better individuals have a much higher chance 
to produce successful offspring than worse ones; the probability of the best 
10%, for example, to produce successful children is almost twice as high as 
the probability of the worst 10% to do so. 


Table 11.23: Average overall genetic propagation of population partitions for 
random parent selection tests. 


Population | Test Strategy 
Percentile 


[0387] 
[OA] 


[O07] 
[007] 
[OA] 


[0.07 
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Obviously, random parent selection leads to an increased number of gen- 
erations that have to be executed until a given selection pressure limit is 
reached. This is graphically shown in Figure 11.21, which gives the selection 
pressure progress for two exemplary test runs of the test series including OS, 
i.e., III and V. In the standard case using random / roulette parent selection 
and offspring selection, II, the selection pressure obviously rises faster than 
when using random parent selection in combination with strict offspring selec- 
tion. Still, even though it takes longer when using random parent selection, 
the characteristics are very similar, i.e., it rises steadily with some notable 
fluctuations. 


Average Selection Pressure Progress 


Sel.Pres. (Ill) 
= = -= Sel.Pres. (V) 


Generations 


FIGURE 11.21: Selection pressure progress in two exemplary runs of test 
series III and V (extended GP with gender specific parent selection and strict 
offspring selection). 
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11.4 Single Population Diversity Analysis 
11.4.1 GP Test Strategies 


Within our first series of empirical tests regarding solutions similarity and 
diversity we analyzed the diversity of populations of single population GP 
processes. For testing the population diversity analysis method described in 
Section 6.2 and illustrating graphical representations of the results of these 
tests we have used the following two data sets: 


e The NO, data set contains the measurements taken from a 2 liter 4 
cylinder BMW diesel engine at a dynamical test bench (simulated ve- 
hicle: BMW 320d Sedan); this data set has already been described in 
Section 11.1 as NO, data set III. 


e The Thyroid data set is a widely used machine learning benchmark data 
set containing the results of medical measurements which were recorded 
while investigating patients potentially suffering from hypothyroidism; 
further details regarding this data set can be found in Chapter 11.2. 


Both data collections have been split into training and validation / test data 
partitions taking the first 80% of each data set as training samples available 
to the identification algorithm; the rest of the data is considered as validation 
data. 

We have used various GP selection strategies for analyzing the NO, and 
the Thyroid data sets: 


e On the one hand, we have used standard GP with proportional as well 
as tournament selection (tournament size k = 3). 


e On the other hand we have also intensively tested GP using offspring 
selection and gender specific parent selection (proportional and random 
selection). 


In general, we have tested GP with populations of 1,000 solution candidates 
(with a maximum tree size of 50 and a maximum tree height of 5), standard 
subtree exchange crossover, structural as well as parametric node mutation 
and total 15% mutation rate; the mean squared errors function was used 
for evaluating the solutions on training as well as on validation (test) data. 
Other essential parameters vary depending on the test strategies; these are 
summarized in Table 11.24. 
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Table 11.24: GP test strategies. 
Strategy Properties 
A) Standard GP | Tournament parent selection 

(tournament size k = 3); 
Number of generations: 4000 
Proportional parent selection; 
Number of generations: 4000 
Gender specific parent selection; 

(Random & proportional) 
Success ratio: 0.8 
Comparison factor: 0.8 
(Maximum selection pressure: 50 

(not reached) 

Number of generations: 4000 
Gender specific parent selection; 

(Random & proportional) 
Success ratio: 1.0 
Comparison factor: 1.0 
Maximum selection pressure: 100 


B) Standard GP 


C) GP with OS 


(D) GP with OS 


11.4.2 Test Results 


In Table 11.25 we summarize the quality of the best models produced using 
the GP test strategies (A) — (D); for the NO, data set the quality is given 
as the mean squared error; for the Thyroid data set we give the classification 
accuracy, i.e., the ratio of samples that are classified correctly. The models 
are evaluated on training as well as on validation data; as each test strategy 
was executed 5 times independently, we here state mean average and standard 
deviation values. 

Obviously, the test series (A) and (D) perform best; the results produced 
using offspring selection are better than those using standard GP. The classifi- 
cation results for the Thyroid data set are not quite as good as those reported 
in [WAW06e] and Section 11.2; this is due to the fact that we here used smaller 
models and concentrated on the comparison of GP strategies with respect to 
population diversity. 


Solution quality analysis is of course important and interesting, but here 
we are more interested in a comparison of population diversity during the 
execution of the GP processes. We have calculated the similarity among the 
GP populations during the execution of the GP test series described in Table 
11.24: The multiplicative similarity approach (as defined in Equations 9.63 
— 9.66) has been chosen; all coefficients c;...ci9 were set to 0.2, only the 
coefficient cı weighting the level difference contribution dı was set to 0.8. 
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Table 11.25: Test results: Solution qualities. 
Results for NO, test series 
GP Strategy 
(A) | (B) (C) (D) 
Trammg (mse 
Training (std(mse 
Validation (mse 


Validation (std(mse 
Evaluated solutions, avg. 
Generations (avg. 
Results for Thyroid test series 
GP Strategy 


‘Training (cl. acc., avg. 0.9794 | 0.9758 0.9781 
ini 0.0032 | 0.0017 0.0035 


Evaluated solutions, avg. 
Generations (avg. 


In Table 11.26 we give the average population similarity values calculated 
using Equation 6.7; again, as each test series was executed several times, we 
give the average and standard deviation values (written in italic letters). As 
we see in the first row, the average similarity values are approximately in the 
interval [0.2; 0.25] at the beginning of the GP runs, i.e., after the initialization 
of the GP populations. In standard GP, as can be seen in the first column, the 
average similarity reaches values above 0.7 after 400 generations and stays at 
approximately this level until the end of the execution of the GP process; in 
the end, the average similarity was ~0.87 in the NO, tests and ~0.81 in the 
Thyroid test series. Analyzing the second and the third column we notice that 
this is not the case in test series (B) and (C): The similarity values do in test 
series (B) by far not rise as high as in series (A) (especially when working on 
the Thyroid data set), and also in test series (C) we have measured significantly 
lower similarities than in series (A) (i.e., the population diversity was higher 
during the whole GP process). Obviously, the use of offspring selection with 
rather soft parameter settings (i.e., success ratio and comparison factor set to 
values below 1.0) does not have the same effects on the GP process as strict 
ones. The by far highest similarity values are documented for test series (D) 
using maximally strict offspring selection (which has produced the best quality 
models, as documented in Table 11.25): As is summarized in the far right 
column, during the whole evolutionary process the mutual similarity among 
the models increases steadily, while also the selection pressure increases. In 
the end, when the selection pressure reaches a high level (in these cases, the 


Data-Based Modeling with Genetic Programming 295 


predefined limit was set to 100) and the algorithm stops, we see a very high 
similarity among the solution candidates, i.e., the population has converged 
and evolution is likely to have gotten stuck. This is in fact consistent with 
the impression already stated in [WAW06a] or [WAWO6e], e.g.; here we see 
that this in fact really happens. 


Table 11.26: Test results: Population diversity (average similarity values; 
avg., std.). 


NO, tests 
Gen. GP Strategy a Gen. | GP Strategy 
a) D) o) (D) 
0.197 
0.039 
0.397 
0.039 
0.603 
0.049 
0.810 
0.039 
0.985 
0.0382 
Thyroid tests 
Gen. GP Strategy = Gen. | GP Strategy 
(A) | B) | © (D) 


0 0.206 | 0.205 | 0.208 0.197 
oosa | oouo | oroso a 
100 0.581 | 0.241 | 0.444 10 0.397 
osr] 0.043 | 0.035 mi 
400 0.737 | 0.321 | 0,610 20 0.602 
ase | oss | 1.020 os 
1000 0.808 | 0.341 | 0.692 40 0.810 
oo| ons| ous] | ou 


4000 


0.812 | 0.343 | 0.701 0.975 
(End of run) | 0.038 | 0.056 | 0.030} run 0.019 


In Table 11.27 we summarize the maximum population diversity values cal- 
culated using Equation 6.8; again we give the average and standard deviation 
values (written in italic letters). As we see in the first (left) column, in stan- 
dard GP with tournament selection the average maximum similarity reaches 
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Table 11.27: Test results: Population diversity (maximum similarity values; 
avg., std.). 


NO, tests 

GP Strategy | Gen. | GP Strategy 

(A) | (B) | (C) (D) 
0.919 | 0.934 | 0.904 0.936 
oai] 0.095 | oas) "| otos 
0.995 | 0.825 | 0.944 10 0.961 
ooul oon | vos) | oos 
0.998 | 0.809 | 0.978 20 0.971 
0.006 | 0.075 | 0.037 | 0.033 


0.999 | 0.811 | 0.965 40 0.995 

voos voso | ao | | oo 

0.999 | 0.819 | 0.969 | End of 0.996 

jus] none | oss] ron | ooo 
Thyroid tests 


Gen. GP Strategy | Gen. | GP Strategy 
(A) | (B) | (©) (D) 


values above 0.95 rather fast, i.e., for all models in the population rather 
similar solutions can be found. This is not the case when using proportional 
selection. When using offspring selection the same effect as in standard GP 
with tournament selection can be seen, especially in the NO, test series. 
The Figures 11.22 — 11.25 exemplarily show the average population diversity 
by giving the distribution of similarities among all individuals. The Figures 
11.22 and 11.23 show the similarity distributions of an exemplary test run of 
series (A) at generation 200 and 4000; obviously, most similarity calculations 
returned similarity values between 0.7 and 1.0, and the distribution at gen- 
eration 200 is comparable to the distribution at the end of the test run. For 
the GP runs incorporating offspring selection this is not the case, as we ex- 
emplarily see in Figures 11.24 and 11.25: After 20 generations most similarity 
values almost fit Gaussian distribution with mean value 0.8, and at the end 
of the run all models are very similar to each other (i.e., the population has 
converged, the selection pressure reaches the given limit and the algorithm 
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Similarity Values Histogram (NOx, A, Generation 200) 
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FIGURE 11.22: Distribution of similarity values in an exemplary run of NO, 
test series A, generation 200. 


stops). 

Finally, Figure 11.26 shows the average similarity values for each model 
(calculated using Equation 6.5) for exemplary test runs of the Thyroid test 
series (A)!? and (D). Obviously, the average similarity in standard GP reaches 
values in the range [0.7;0.8] very early and then stays at this level during the 
rest of the GP execution. When using gender specific selection and offspring 
selection, otherwise, the average similarity steadily increases during the GP 
process and almost reaches 1.0 at the end of the run, when the maximum 
selection pressure is reached. 


11.4.3 Conclusion 


Structural similarity estimation has been used for measuring the genetic 
diversity among GP populations: Several variations of genetic programming 
using different types of selection schemata have been tested using fine-grained 
similarity estimation, and two machine learning data sets have been used 
for these empirical tests. The test results presented show that population 
diversity differs a lot in the test runs depending on the selection schemata 
used. 


12In fact, for the test run of series (A) we here only show the progress over the first 2000 
generations. 
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Similarity Values Histogram (NOx, A, Generation 4000) 
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FIGURE 11.23: Distribution of similarity values in an exemplary run of NO, 
test series A, generation 4000. 


Similarity Values Histogram (NOx, D, Generation 20) 
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FIGURE 11.24: Distribution of similarity values in an exemplary run of NO, 
test series (D), generation 20. 
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Similarity Values Histogram (NOx, D, Generation 95) 
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FIGURE 11.25: Distribution of similarity values in an exemplary run of NO, 
test series (D), generation 95. 


30 
Iterations 


FIGURE 11.26: Population diversity progress in exemplary Thyroid test runs 
of series (A) and (D) (shown in the upper and lower graph, respectively). 


300 Genetic Algorithms and Genetic Programming 


11.5 Multi-Population Diversity Analysis 


Our second series of empirical tests regarding solutions similarity and di- 
versity was dedicated to the diversity of populations of multi-population GP 
processes; for testing the multi-population diversity analysis method described 
in Section 6.2 and illustrating graphical representations of the results of these 
tests we have again used the following two data sets: The NO, data set III 
described in Section 11.1 as well as the Thyroid data set. 

Both data collections have been split into training and validation / test 
data partitions: In the case of the NO, data set the first 50% of the data set 
were used as training samples; in the case of the Thyroid data set the first 
80% were considered by the training algorithms. 


11.5.1 GP Test Strategies 


In general, 4 different strategies for parallel genetic programming have been 
applied: 


e Parallel island GP without interaction between the populations; i.e., all 
populations evolve independently. 


e Parallel island GP with occasional migration after every 100th gener- 
ation in standard GP and every 5th generation in GP with offspring 
selection: The worst 1% of each population p; is replaced by copies of 
the best 1% of solutions in population p;_1; the best solutions of the last 
population (in the case of n population that is pn) replace the worst ones 
of the first population (pı). The unidirectional ring migration topology 
has been used. 


e Parallel island GP with migration after every 50th generation in stan- 
dard GP and every 5th generation in GP with offspring selection: The 
worst 5% of each population p; is replaced by copies of the best 5% of 
solutions in population p;_1. Again, the unidirectional ring migration 
topology has been used. 


e Finally, the SASEGASA algorithm as described in Chapter 5 has been 
used as well. 


In all cases the algorithms have been initialized with 5 populations, each 
containing 200 solutions (in our case representing formulas, of course). Addi- 
tionally, each of the first 3 strategies has been tested with standard GP settings 
as well as offspring selection; Table 11.28 summarizes the 7 test strategies that 
have been applied and whose results shall be discussed here. 
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11.5.2 Test Results 


All test strategies summarized in Table 11.28 have been executed 5 times 
using the NO, as well as the Thyroid data set. Multi-population diversity 
was measured using the equations given in Section 6.2.2: For each solution 
we calculate the average as well as the maximum similarities with solutions 
of all other populations of the respective algorithms (in the following, these 
values are denoted as MPdiv values). Additionally, we have also collected 
all solutions of the algorithms’ populations into temporary total populations 
and calculate the average as well as the maximum similarities of all solutions 
compared to all other ones (hereafter denoted as SPdiv values). 

Again, the multiplicative structural similarity approach (as defined in Equa- 
tions 9.63 — 9.66) has been used for estimating the similarity of model struc- 
tures; all coefficients c1 . . . C1o were set to 0.2, only the coefficient cı weighting 
the level difference contribution dı was set to 0.8. 

In the following we summarize these values for all test runs by stating the 
average values as well as standard deviations: Table 11.29 summarizes the 
results of the test runs using the Thyroid data set, 11.30 those of the test runs 
using the NO, data set. 

Figure 11.27 exemplarily illustrates the multi-population diversity in a test 
run of series F at iteration 50: The value represented in row 7 of column j in 
bar k gives the average similarity of model 2 of population k with all formulas 
stored in population j. Low multi-population similarity values are indicated 
by light cells; dark cells represent high similarity values. 
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Table 11.28: GP test strategies. 


Strategy Properties 


A) Parallel standard GP | ‘Tournament parent selection 
(tournament size k = 3); 

Number of generations: 2000 

Random & roulette parent selection 

Strict Offspring selection (success ratio: 1.0, 
comparison factor: 1.0, 
maximum selection pressure: 200) 


B) Parallel GP with OS 


C) Parallel standard GP, 


1% migration 


‘Tournament parent selection 
(tournament size k = 3); 
Number of generations: 2000 
1% best / worst replacement after 
every 100th generation 
Random & roulette parent selection 
Strict Offspring selection (success ratio: 1.0, 
comparison factor: 1.0, 
maximum selection pressure: 200) 
1% best / worst replacement after 
every 5th generation 
‘Tournament parent selection 
(tournament size k = 3); 
Number of generations: 2000 
5% best / worst replacement after 
every 50th generation 
Random & roulette parent selection 
Strict Offspring selection (success ratio: 1.0, 
comparison factor: 1.0, 
maximum selection pressure: 200) 
5% best / worst replacement after 
every 5th generation 
Random & roulette parent selection 
Strict Offspring selection (success ratio: 1.0, 
comparison factor: 1.0, 
maximum selection pressure: 200) 


(D) Parallel GP with OS, 


1% migration 


E) Parallel standard GP, 
5% migration 


F) Parallel GP with OS, 


5% migration 
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Table 11.29: Multi-population diversity test results of the GP test runs using 


the Thyroid data set. 


‘Test Series 


aE 
| or] 
| 0.2130 

| 


or] 
ana A 
[ss E EEE 
C EA A E 
[ons | ers EE 
| SST | ZA 7 
EEEE 


0.2356 


3 
| 0.3276 

| rase] 

[0a 

0757 | 


37 
0.3395 


0.483% 


11.5.3 Discussion 


As we see in Tables 11.29 and 11.30, the average diversity among popu- 
lations in parallel island GP without interaction (i.e., in test series (A) and 
(B)) rises up to values between 0.35 and 0.4, no matter whether or not OS is 
applied; the maximum values eventually reach values between 0.45 and 0.5. 
Considering all solutions collected in temporary total populations, as expected 
the average similarities reach values below 0.4, the maximum similarities al- 
most reach 1.0. 

The similarity values monitored in test series (C) and (D) are, in com- 
parison to those of series (A) and (B), slightly higher, but not dramatically. 
This does not hold for the next pair of test series (with 5% migration): The 
similarity values calculated for test series (Æ) and (F) are significantly higher 
than those of test series (A) — (D); in other words, the exchange of only 5% 
of the populations’ models can lead to a significant decrease of population 
diversity among populations of multi-population GP. 

When using the SASEGASA, the diversity among populations is high in 
the beginning and then steadily decreases as the algorithm is executed. This 
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Table 11.30: Multi-population diversity test results of the GP test runs using 
the NO, data set III. 


Results for the NO data set 


70903 


is of course due to the reunification of populations as soon as the maximum 
selection pressure is reached. 

By executing these test series and analyzing the results as given in this 
section we have demonstrated how multi-population diversity can be moni- 
tored using similarity measures as those described in Section 9.4.2. Reference 
values are given by parallel GP without migration; of course, the higher the 
migration rates become, the more migration affects the diversity among GP 
populations. When using the SASEGASA, rather high multi-population spe- 
cific diversity is given in the early stages of the parallel GP process, and due 
to the merging of population the diversity decreases and in the end reaches 
diversity values comparable to those of single population GP with offspring 
selection. 
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FIGURE 11.27: Exemplary multi-population diversity of a test run of Thyroid 
series F at iteration 50, grayscale representation. 
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11.6 Code Bloat, Pruning, and Population Diversity 
11.6.1 Introduction 


In Chapter 2.6 we have described one of the major problems of genetic 
programming, namely permanent code growth, often also referred to as bloat; 
evolution is also seen as “survival of the fattest,” and, as Langdon and Poli 
expressed it, fitness-based selection leads to the fact that “fitness causes bloat” 
[LP97]. There are several approaches for combating this unwanted unlimited 
growth of chromosome size, some of them being 


e limiting the size and / or the height of the program trees, 
e pruning programs, and 


e punishing complex programs by decreasing their quality depending on 
their respective tree representations’ size and / or height. 


Of course, there is no optimal strategy for fixing formula size parameters, 
population size, or pruning strategies a priori (see also remarks in Chapter 2). 
Still, some code prevention strategies are surely more recommendable than 
others; we here report on an exemplary test series for characterizing some of 
the possible approaches. 

In all other test series executed and reported on in other sections in this 
book we have used fixed complexity limits (limiting size and height of program 
trees); we shall here report on our tests regarding code growth in GP-based 
structure identification applying the pruning strategies presented in Section 
9.3.2 as well as structure tree size dependent fitness manipulation and fixed 
size limits (partially with additional pruning). All these approaches have been 
tested using standard GP as well as extended GP including gender specific 
selection and offspring selection. As an example, we have tested these GP vari- 
ants on the NO, data set II presented and described in Section 11.1.2.3; pop- 
ulation diversity, formula complexity parameters as well as additional pruning 
effort (only in case of applying pruning, of course) have been monitored and 
shall be reported on here. 

We have again used 50% of the given data for training models (namely 
samples 10,000 — 28,000), and 10% as validation data (samples 28,001 — 32,000 
used by pruning strategies) and ~7.5% as test data (samples 32,001 — 35,000). 
As we are also aware of the problem of overfitting, we have systematically 
collected each GP run’s best models with respect to best fit on training as 
well as on validation data (using the mse function for estimating the formulas’ 
qualities). The algorithm is designed to optimize formulas with respect to 
training data; validation data are only used for pruning strategies (if used at 
all). At the end of each test run, the models with best fit on training as well as 
on validation data are analyzed, and in order to fight overfitting we select the 
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best model on validation data as the result returned by the algorithm. Test 
data, which are not available to the algorithm, are used for demonstrating 
that this strategy is a reasonable one: Analyzing the evaluation of the best 
models on test data we see that those that are best on validation data perform 
better on test data than those that were optimally fit to training data. 

During the GP process, the standard mean squared error function was 
used; the time series specific fitness function considering plain values as well 
as differential and integral values was used for selecting those models that 
perform best on training and validation data. All three components (i.e., 
plain values, differentials, and integral values) have been weighted using equal 
weighting factors. When comparing the quality of the results documented in 
the following sections we again state the fitness values calculated using the 
mean squared errors function. 


11.6.2 Test Strategies 


In detail, the following test strategies have been applied: On the one hand 
the parameters for standard and extended GP are summarized in Table 11.31, 
and the code growth prevention parameters are summarized in Table 11.32. 
In all tests the initial population was created using a size limit of 50 nodes 
and a maximum height of 6 levels for each structure tree. 


Table 11.31: GP parameters used for code growth and bloat prevention tests. 


Variant | Parameters 


1 Population size: 1000 
(Standard GP, | 2000 generations 
SGP) Single point crossover; structural and 
parametric node mutation 
Parent selection: Tournament selection (k = 3) 
2 Population size: 1000 


(Extended GP, | Single point crossover; structural and 
EGP) parametric node mutation 
Parent selection: Gender specific selection 
(random & proportional) 
Strict offspring selection 
(maximum selection pressure: 100) 


In the following table and in the explanations given afterwards, md is the 
maximum deterioration limit and mc the maximum coefficient of deterioration 
and structure complexity reduction as described in Section 9.3.2. For ES- 
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based pruning, mr denotes the maximum number of rounds, and mur the 
maximum number of unsuccessful rounds. 

In those tests including increased pruning (as applied in test series (h) 
and (i)) the initial pruning ratio is set to 0.3, i.e., in the beginning 30% of 
the population are pruned. Then, during the process execution, the pruning 
rate steadily increases and finally reaches 0.8; in standard GP runs, the rate 
is increased linearly, and in extended GP including offspring selection we 
compute the actual pruning ratio in relation to the actual selection pressure 
(so that in the end, when the selection pressure has reached its maximum 
value, the pruning rate has also reached its maximum, namely 0.8). 

Furthermore, fs stands for the formula’s size (i.e., the number of nodes in 
the corresponding structure tree), and pf is the fitness punishment factor: 
If structure complexity based punishment is applied, then the fitness f of a 
model is modified as f’ = f «(1+ pf) (if pf > 0). 


Table 11.32: Summary of the code growth prevention strategies applied in 
these test series. 


Variant | Characteristics 

a | No code growth prevention strategy 
20% systematic pruning: md = 0, mc = 1 
20% ES-based pruning: md = 0, mc = 1, 
A=5, mr = 5, mur = 1 
50% ES-based pruning: md = 0.5, mc = 1, 

A = 10, mr = 10, mur = 2 
100% ES-based pruning: md = 2, mc = 1.5, 

A = 20, mr = 10, mur = 2 
Increasing ES-based pruning: md = 1, mc = 1.5, 

A = 10, mr = 10, mur = 2 
Quality punishment: pf = (, 
Fixed limits: Maximum tree height 6, maximum tree size 50 
Fixed limits: Maximum tree height 6, maximum tree size 50 
combined with occasional ES-based pruning 
standard GP: every 5”, extended GP: every 2”¢ generation 
md = 1, me = 1, A = 10, mr = 5, mur = 2 


Please note that in strategies (b) and (c) pruning is done after each genera- 
tion step, whereas in (d) — (g) it is done after each creation of a new model by 
crossover and / or mutation. In standard GP this does not make any differ- 
ence, but when using offspring selection the decision whether to prune after 
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each creation or after each generation has major effects on the algorithmic 
process. 

The mean squared errors function (with early stopping, see Section 9.2.5.5) 
was used here since we mainly concentrate on pruning and population dynam- 
ics relevant aspects. Furthermore, all variables (including the target variable) 
were linearly scaled to the interval [-100; +100]. 


11.6.3 Test Results 


Once again, all test strategies have been executed 5 times independently; 
formula complexity has been monitored (and protocolled after each generation 
step) as well as structural population diversity which was protocolled after 
every 10” generation: The multiplicative similarity approach (as defined in 
Equations 9.63 — 9.66) has again been chosen; all coefficients c; ...ci9 were 
set to 0.2, only the coefficient cı weighting the level difference contribution 
dı was set to 0.8. The similarity of models was calculated symmetrically (as 
described in Equation 6.4). 


11.6.3.1 No Formula Size Limitation 


Exactly as we had expected, extreme code growth also occurs in GP-based 
structure identification; Figure 11.28 illustrates the progress of formula com- 
plexity in terms of formula size in exemplary test runs of series la and 2a: 
The average formula size is given as well as minimum and maximum values 
and the progress of the best individual’s size. 

As we see here, formulas tend to grow very big rather quickly; when using 
offspring selection, this effect is even a bit more obvious: On average, in 
standard GP the formula size has reached 212.84 after 30 iterations; when 
using OS the average formula size was even higher after 30 generations (namely 
276.35). 


11.6.3.2 Light Pruning 


The results of test series (b) and (c) can be summarized in the following 
way: Without any further mechanisms that limit the structural complexity 
of formula trees, light pruning as described in strategies (b) and (c) is not an 
appropriate way to prevent GP from growing enormous formula structures. 
After 100 generations, the average formula size in standard GP has grown 
to 471.34 in test series (1b) and 333.65 in test runs of series (1c) (average 
standard deviation: 204.29 and 238.27, respectively); in extended GP the 
average formula size at generation 30 on average reached 293.26 and 276.12 in 
test runs (2b) and (2c), the respective standard deviations being 157.23 and 
124.80. 

Systematically analyzing the results of the pruning phases performed in test 
runs (b) and (c) we can compare the performances of ES-based and systematic 
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FIGURE 11.28: Code growth in GP without applying size limits or complexity 
punishment strategies (left: standard GP, right: extended GP). 


pruning. For this purpose we have collected the pruning performance statistics 
for the tests (b) and (c) and summarize them in Table 11.33: 


Table 11.33: Performance of systematic and ES-based pruning strategies. 


Parameter Systematic | ES-based 
pruning | pruning 
Solutions evaluated for 161.02 


pruning one solution ee 


Runtime consumed (per iteration | 31.27 sec 


Average coefficient of deterioration | 0.2495 


and reduction of structural complexity 


12.23 sec 
0.4053 


Obviously, both pruning methods performed approximately equally well 
and were able to reduce the complexity of the formulas that were supposed to 
be pruned. Additionally, we also see that especially for bigger model struc- 
tures the runtime consumption is a lot higher when using systematic pruning; 
in the course of a GP process it is not considered necessary or even beneficial 
to reduce models as much as possible. Therefore we shall in the following 
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test runs concentrate on ES-based pruning phases. Thus, we suggest using 
systematic pruning as a preparation step for results analysis, but not during 
the execution of GP-based training processes. 


11.6.3.3 Medium Pruning 


Medium pruning, as applied in test series (d), is in fact able to reduce the 
size of the formulas stored in the GP populations significantly. 


Table 11.34: Formula size progress in test series (d). 


Test series 


Formula size 
avg std 


L O 
2000 | 168.23 


7 10 for 


E EE 
50 ; 


Table 11.35: Quality of results produced in test series (d). 


Test Best model selection basis 
series | Evaluation data at data Validation data 


vrata 17,962.78 | 762.97 | 15,850.49 


Test data [7,162.48 [090.10 [5,996.27 
[Validation data [14,590.83 | 1,470.25 | 10,500.30 
[Test data [ 631128 | 77042 4439.27 | 


The best results obtained in the (d) test series are summarized in Table 
11.35: For each test run we have collected the models with best fit on training 
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data as well as those that perform best on validation data; average values are 
given as well as standard deviations. Obviously, rather strong overfitting has 
happened here; as we had expected, the production of very large formulas 
leads to over-fit formulas that are not able to perform well on samples that 
were not used during the training phase. 


11.6.3.4 Strong Pruning 


Rather strong pruning was applied in test series (e), and as we see in Ta- 
ble 11.36, the formulas produced by GP are significantly smaller than those 
produced in the previous test series. Still, we observed the fact that genetic 
diversity is lost very quickly: Already in early stages of the evolutionary pro- 
cesses, the average structural similarity of solutions reaches a very high level 
(which is documented in the two most right columns of Table 11.36). 

The quality of the best models produced is very bad (above 5,000), which 
is why we do here not state any further details about the evaluation of these 
models on the given data partitions. We suppose that this low results quality 
is connected to the loss of population diversity (and of course also the fact that 
the pruning operations applied were allowed to decrease the models’ quality). 


Table 11.36: Formula size and population diversity progress in test series (e). 


Test series | Iteration | Formula size | Solutions similarity 
avg std std 


Te [50] a | 0.8919, MJE 
E LN EEEO MEESE LICE 0.0289 
500] 19.75 [23.52 | 0.9685 0.0187 
2000 [21-39 [20.87 | 0.987 | 0.0095 

o) o fir] 92097] 0.0318 
P20 19.86 | 10.83 | 0.9835 0.0217 
O 21.64 [16.31 | 0.9921 | 0.0082 


End of run | 20.03 | 18.27 | 0.9943 0.0093 


11.6.3.5 Increased Pruning 


As light, medium, and strong pruning did not lead to the desired results, 
we have also tried increasing pruning as defined in test strategy (f). As we see 
in Table 11.37, this strategy performs rather well: The size of the formulas 
produced by GP rises especially in early stages of the GP process, but then 
decreases and on average finally reaches values between 80 and 100. 

In addition to this, the population diversity stays higher in the beginning 
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than in GP tests including constantly strong pruning, but eventually decreases 
and the solutions finally show higher similarities due to the increased pruning 
in later algorithmic stages. 


Table 11.37: Formula size and population diversity progress in test series (f). 


Test series | Iteration | Formula size | Solutions similarity 
avg std std 


T M [0273 [95.76 E 0.0943 
EU EEEN LEA 0-1059 
50 92.45 | 10741 [0.6820] 0-1124 
2000 [87.02 | 90.68 | 0.8035 | 0.0861 

X E EE | 028 | 0.0612 
[2063.59 [59.37 | 0.7052 | 
M 80.26 | 40-99 | 0.950 0.0588 
End of run | 79.45 | 4767 J 0.9907 | 0.0156 


The quality values of the results produced in this test series are summarized 
in Table 11.38. Obviously, less overfitting has happened than in the tests with 
light or medium pruning. 


Table 11.38: Quality of results produced in test series (f). 


Evaluation data 


Best model selection basis 
Training data | Validation data 
avg std avg std 

827.83 


Test 
series 


[Validation data | 8,901.91 | 611.02 [5,981.52 
Test data | 3,786.51 | 800.38 | 2,830.78 
2275.24 | 619.11 [3,81193 


2,597.35 | 512.04 | 7,781.28 


[Validation data | 9,712.98 | 767.56 | 5,302.62 
C Test data | 4,912.38 | 1,198.58 | 2,275.03 


11.6.3.6 Complexity Dependent Quality Punishment 


In fact, our GP test runs including complexity dependent quality punish- 
ment, i.e., those of test strategy (g), were also able to produce acceptable 
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results for the NO, data set investigated here. As we see in Table 11.39, 
in standard GP the formula sizes are rather high in the beginning and then 
decrease steadily, whereas in GP with offspring selection the models on aver- 
age include between 50 and 60 nodes during the whole execution of the GP 
processes. Population diversity values are comparable to those reported for 
GP tests without pruning or quality dependent punishment as summarized 
for example in Section 11.4. 

Figure 11.29 illustrates the formula complexity progress of an exemplary GP 
run of test series (2g). The qualities of the models with best fit on training 
and validation are summarized in Table 11.40. 


Table 11.39: Formula size and population diversity progress in test series (g). 


Test series | Iteration | Formula size | Solutions similarity 
avg std 


avg std 
| 50 [140.76 | 90.75 | 0.3824 


[100 E 0.3916 | 
[300 [73.75 [64-99 [0.6387 | 
P2000 79.07 [F761 f 0.7202 
10 foa | 64.67 | 0.4873 | 
ee A SO EA 
so f 05.34 [48.33 | 0.8007 | 
[End of ran | 58.82 [ 4187 | 0.9315 | 


Table 11.40: Quality of results produced in test series (g). 


Test 
series 


Evaluation data Best model selection basis 
Training data | Validation data 
avg std avg std 


(ig) | Training data | 1,837.84 | 526.10 | 4,729.42 
12,902.67 | 767.35 | 4,531.73 
2,597.73 | 835.41 | 2,708.36 

— 
Da | 
z c | 


| 
[Validation data | 9,345.87 | 738.60 | 3,949.64 
| 


Jg 1102.19 [59381 | 3,121.80 
3853.62 | 812.51 | 2,618.94 
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FIGURE 11.29: Progress of formula complexity in one of the test runs of 
series (1g), shown for the first ~400 iterations. 


11.6.3.7 Fixed Size Limits 


In the case of fixed size limits the crossover and mutation operators have to 
consider limits for the complexity of models. Model size and population di- 
versity statistics for test series (h) are summarized in Table 11.41; in GP with 
offspring selection all formulas eventually are maximally big, and the solutions 
similarity values show results comparable to those reported in Section 11.4. 
Table 11.42 summarizes the quality of the results produced, again evaluated 
on training, validation, and test data. Figure 11.30 illustrates the formula 
complexity progress of exemplary GP test runs of series (1h) and (2h). 


Table 11.41: Formula size and population diversity progress in test series (h). 


Test series | Iteration | Formula size | Solutions similarity 
avg std 


50 [0206187 | 0.8907 | 
End of ran | 50.0000] 0.0000 | 0.9751 | 
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Table 11.42: Quality of results produced in test series (h). 


Test 
series 


[Validation data | 10,801.77 | 923.04 | 4,248.37 
Test data | 5,791.25 | 1,206.51 | 2,610.64 
7,508.12 | 382.04 | 3,083.64 


L774.94] 300.51 | 4,108.30 


[Validation data | 9,641.89 | 833.71 | 3,738.13 
Test data | 4,802.30 | 1,371.22 | 1,374.61 


FIGURE 11.30: Progress of formula complexity in one of the test runs of 
series (1h) (shown left) and one of series (2h) (shown right). 


In addition to total statistics we shall also discuss two selected models re- 
turned by one of the test runs of series (2h): Model b; is the model that 
performs best on training data (shown in Figure 11.31), b, the one that per- 
forms best on validation data (shown in Figure 11.32). The error distributions 
on training, validation, and test data partitions are illustrated in Figure 11.33. 
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Table 11.43 characterizes the performance of b; and b, by means of mean 
squared errors as well as the integral values. For this we have calculated the 
sum of the target values on training, validation, and test data and compared 
these integral values to those calculated using the models under investigation. 
Obviously, b; shows a better integral fit on training (and also validation) data, 
but when it comes to test data, the model that performed best on validation 
data (b,) produces much more satisfying results (with an integral error of only 
2.354% on test data). 


Table 11.43: Comparison of best models on training and validation data (b; 
and b,, respectively). 


Training quality (MSE 
Validation quality (MSE 
Test quality (MSE 


[6.037 = 10° 


[(-0.452% 


B86 * 10° 
+2.354%) 


11.6.3.8 Fixed Size Limits and Occasional Pruning 


Finally, test series with fixed size limits and occasional pruning have also 
been executed and analyzed; the results regarding formula complexity, pop- 
ulation diversity, and results qualities are summarized in Tables 11.44 and 
11.45. 


Obviously, the results produced are (with respect to evaluation quality) 
comparable to those produced in the previous series. Still, of course the 
formula sizes are a bit smaller (due to pruning), and also overfitting seems 
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a 


0 Ou 
= eon 


+(#([ 1, 030496* Var 003(t- 0) ] | - ([ 1, 02955555783691*Var 007(t- 8) ] | [ 1, 02955555783691* Var 007(t-9)1)| / (LO, 859604*Var 007( t- 7) ] | Si gnunt +(- 19, 6774250843158 
| (0, 925277*Var007(t-0)])))| /(L0, 796922* Var 008(t-10)]| Si gnum +(- 13, 4653753764306] [ 1, 009622* Var 007(t-1)]))))| #( +(L 1, 064909* Var 006(t- 6) ] 

| (0, 245936*Var001(t-5)]]-8, 11313346378627| - ([ 1, 02466876708512* Var 007(t- 10) ] | [ 1, 024668767085 12* Var 007(t-11)]) | - (I1, 0299359050001 1* Var 007(t- 10) ] 

| (1, 02993590500011* Var 007(t- 11)1))|[1, 062877* Var 006( t- 6) ] | [ 0, 792079* Var 008( t - 10) ] | Expl Sqrt([ 1, 008566" Var 007(t-1)1))| /([0, 796922" Var 008(t- 10) ] 

| Si gnum +(- 19, 6774250843158] [ 1, 009622* Var 007(t-1)]))))| *(L 1, 030496" Var 003( t-0)]| #([1, 070487* Var 007(t-0)]| Si gnunt Si n([1, 015816*Var 004(t-6)1))| - 
(Si gnunt [ 0, 925277*Var 007(t-0)])|[ 1, 028206*Var 003(t-7)}) | Exp( Si m([0, 859604* Var 007(t-7)1)))) | =(L 1, 02015787437991*Var 007(t-9)] 

| (1, 02015787437991*Var 007(t-10)]) | -([1, 02955555783691* Var 007(t-8)]|[1, 02955555783691* Var 007(t-9)1)) 


FIGURE 11.31: Model with best fit on training data: Model structure and 
full evaluation. 


— aa 
— mer 
1000: 
500 | | | 
Mi AU, 
10000 15000 20000 25000 30000 38000 
1F(<(1 F(8&( <(- ( LO, 878205630205626* Var 009( t - 0) ] | [0, 878205630205626*Var 009( t - 1)] ) | 14, 6416958621683) | <=(-([ 0, 916623127059497* Var 002( t - 6) ] 


| LO, 916623127059497*Var 002({ t- 7) ] ) | 14, 6416958621683) ) ) THEN( 17, 4222949615303), ELSE([ 1, 174105*Var003(t-7)1)| 1 F(<([ 0, 928868*Var 008( t - 4) ] | - 

(LO, 916872632186935*Var 006(t-5)]| [0, 916872632186935*Var 006( t -6)])))THEN([0, 970435*Var 007(t-9)]), ELSE(I F(<=(- ([0, 806610292256365*Var 002(t-1)] 

[0, 806610292256365*Var 002( t- 2) ] ) | 13, 4084736247274) ) THEN [ 0, 974565*Var 007(t-0)]), ELSE([ 1, 085533*Var004(t-1)])))) THEN( *(1 F(&&{ <([0, 973703*Var 004(t - 6) ] 
| = 20) | >=([1, 176232*Var 003(t- 10) ] | [0, 983657*Var 007( t - 0) ] ) ) ) THEN( 13, 4084736247274) , ELSE([ 0, 983657*Var 007(t- 0)])|+(10, 796939*Var 003(t - 0) ] 

| [1, 176232* Var 003(t-10)]| 13, 4084736247274) )), ELSE( +(- 20| [ 1, 085533* Var 004({ t - 1)] | +(1 F( ==([ 0, 970435*Var 007(t-9)] | - 20) ) THEN([ 0, 786266*Var 003( t- 10)]), 
ELSE([0, 983657*Var 007(t-0)])|[0, 970435*Var 007(t-9)]|[0, 983657*Var 007(t-0)]))) 


FIGURE 11.32: Model with best fit on validation data: Model structure and 
full evaluation. 


to have decreased: Even though the fit on training data is not as good as on 
previous test series, the quality on test data is still very good and comparable 
to the test performance reached in test series (g) and (h). 


11.6.4 Conclusion 


In this section we have demonstrated the effects of code bloat and selected 
prevention strategies for GP. As expected and known from literature, with- 
out any limitations or size reducing strategies GP tends to produce bigger 
and bigger models that fit the given training data, but of course this also in- 
creases the probability of producing over-fit models. Pruning strategies have 
been analyzed, and the test results show that only strong pruning is able to 
prevent GP from producing bigger and bigger models, which again decreases 
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10) (IV) 


(i) (VI 


FIGURE 11.33: Errors distributions of best models: Charts I, II, and III show 
the errors distributions of the model with best fit on training data evaluated 
on training, validation, and test data, respectively; charts IV, V, and VI show 
the errors distributions of the model with best fit on validation data evaluated 
on training, validation, and test data, respectively. 


population diversity and leads to results which are not optimal. Complexity 
dependent fitness punishment as well as fixed size limits enable GP to produce 
quite good results; occasional pruning in combination with fixed size limits 
can help to decrease overfitting. 
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Table 11.44: Formula size and population diversity progress in test series (i). 


Test series | Iteration | Formula size | Solutions similarity 
avg std std 


Ti E A 0.0852 
EC [37.1865 | T9901 ACE 0.0711 
[500 [39.2217 [5.2075 [08388] 0.0450 
P2000 [40.1360 | -19727 | 0.0251 
z [10] 185380 [6.0114 | 0.0518 


20] 21.5280 | 5.3083 | 0.7202 0.0772 
[50] 38.5143 | 5.0305 | 0.9248 0.0403 
[End of run | 48.2051 | 4.0228 | 0.9859 0.0178 


Table 11.45: Quality of results produced in test series (i). 


Test series | Evaluation data Best model selection basis 
Training data Validation data 


[Training data | 2,258.22 | 501.27 | 5,809.40 
[Validation data | 6,608.26 | 1,403.49 | 4,819.20 
Test data | 2,238.61 | 983.57 | 1,811.05 


[Validation data | 6,301.46 | 921.26 | 3,007.13 
C Test data | 3,289.33 | 945.79 | 1,434.63 


1,723.07 | 623.11 | 4,209.57 


Conclusion and Outlook 


In this book we have discussed basic principles as well as algorithmic improve- 
ments in the context of genetic algorithms (GAs) and genetic programming 
(GP); new problem independent theoretical concepts have been described 
which are used in order to substantially increase achievable solution quali- 
ties. The application of these concepts to significant combinatorial optimiza- 
tion problems as well as structure identification in time series analysis and 
classification has also been described. 


We have presented enhanced concepts for GAs, which enable a self-adaptive 
interplay of selection and solution manipulation operators. By using these 
concepts we want to avoid the disappearance and support the combination 
of alleles from the gene pool that represent solution properties of highly fit 
individuals (introduced as relevant genetic information). As we have shown 
in several test series, relevant genetic information is often lost in conventional 
implementations of GAs and GP; if this happens, it can only be reintroduced 
into the population’s gene pool by mutation. This dependence on mutation 
can be reduced by using generic selection principles such as offspring selec- 
tion (which is also used in the SASEGASA) or self-adaptive population size 
adjustment (as used by the RAPGA). The survival of essential genetic infor- 
mation by supporting the survival of relevant alleles rather than the survival 
of above-average chromosomes is the main goal of both these approaches. 


In the empirical part of this book we have documented and discussed our 
experiences in applying these new algorithmic concepts to benchmark as well 
as real world problems. Concretely, we have used traveling salesman prob- 
lems as well as vehicle routing problems as representatives of combinatorial 
optimization problems; time series and classification analysis problems have 
been used as application areas of data-based structure identification with ge- 
netic programming. We have compared the results achievable with standard 
implementations of GAs and GP to the results achieved using extended al- 
gorithmic concepts that do not depend on a concrete problem representation 
and its operators; the influences of the new concepts on population dynamics 
in GA and GP populations have also been analyzed. 
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Nevertheless, there is still a lot of work to be done in the context of the 
research areas we have dealt with in this book. Furthermore, there are a lot 
of potential synergies which have to be considered and should be explored. 


e The most important aspect is the following one: As the enhanced algo- 
rithmic concepts discussed in this book are problem independent, they 
can be applied to any kind of optimization problem which can be tack- 
led by a GA or GP. Of course, there are numerous kinds of optimization 
problems beside traveling salesman and vehicle routing problems which 
can be solved successfully by genetic algorithms; regarding GP we have 
up to now more or less only gained experience in using offspring selec- 
tion in the context of data-based modeling, but there is a huge variety 
of other problems which should also be tried to be solved using the 
approaches discussed in this book. 


e HeuristicLab (HL) is our environment for developing and testing opti- 
mization methods, tuning parameters, and solving a multitude of prob- 
lems. The development of HL was started in 2002 and has meanwhile led 
to a stable and productive optimization platform; it is continuously en- 
hanced and a topic of several publications ([WA04c], [WA04a], [WA04b], 
[WA05a], and [WWB*07]). On the respective website!’ the interested 
reader can find information about the design of HeuristicLab, its de- 
velopment over the years, installable software packages, documentation, 
and publications in the context of HeuristicLab and the research group 
HEAL". 


One of the most beneficial aspects of HeuristicLab is its plug-in based 
architecture. In software engineering in general, plug-in based software 
systems have become very popular; by not only splitting the source code 
into different modules, but compiling these modules into enclosed ready- 
to-use software building blocks, the development of a whole application 
or complex software system is reduced to the task of selecting, combin- 
ing, and distributing the appropriate modules. Due to the support of 
dynamic location and loading techniques offered in modern application 
frameworks as for example Java or .NET, the modules do not need to 
be statically linked during compilation, but can be dynamically loaded 
at runtime. Thus, the core application can be enriched by adding these 
building blocks, which are therefore called “plug-ins” as they are addi- 
tionally plugged into the program. 


Several problem representations, solution encodings, and numerous algo- 
rithmic concepts have so far been developed for HeuristicLab, realizing 
a large number of heuristic and evolutionary algorithms (genetic algo- 
rithms, genetic programming, evolution strategies, tabu search, etc.) for 


13http://www.heuristiclab.com. 
M4Heuristic and Evolutionary Algorithms Laboratory, Linz / Hagenberg, Austria. 
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a wide range of problem classes including the traveling salesman prob- 
lem, the vehicle routing problem, real-valued test functions in different 
dimensions, and, last, but not least, also data-based modeling. 


Still, not only the software platform itself is flexible and extensible, also 
the algorithms provided in HL are (since version 2.0) not fixed and hard- 
coded, but can be parameterized and even designed by the user. This is 
possible by realizing all solution generating and processing procedures 
as operators working on single solutions or sets of solutions. 


By providing a set of plug-ins, each realizing a specific solution represen- 
tation or operation, the process of developing new heuristic algorithms 
is revolutionized. Algorithms do not need to be programmed anymore, 
but can be created by combining operators of different plug-ins. This 
approach has a huge advantage: By providing a graphical user inter- 
face for selecting plug-ins and combining operators, no programming or 
software engineering skills are necessary for this process. As a conse- 
quence, algorithms can be modified, tuned, or developed by experts of 
different fields with little or even no knowledge in the field of software 
development. 


‘Standard Optimization Workbench Wee 


Basic Optimization Workbench Parameters | Graphical Analysis 
coresertation 


Average [x] 


| | IZI Use Same Colctor At INI (Change 
EEEE + > Elea 


| Population Diversity Analysis 


TT MET LTT PTE Rese save 


FIGURE 11.34: A simple workbench in HeuristicLab 2.0. 


In Figure 11.34 we show a screenshot of a simple HeuristicLab work- 
bench (version HL 2.0): A structure identification problem is solved by 
a GP algorithm using offspring selection. All relevant parts of the algo- 
rithm (as for example population initialization, crossover, generational 
replacement, and offspring selection) can be seen in the left part of the 
workbench GUI; these parts can be easily rearranged or replaced by 
users who are not necessarily experts in heuristic optimization or even 
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computer science. Thus, we want to transfer algorithm development 
competence from experts in heuristic optimization to users working on 
concrete applications; users, who work in domains other than heuristic 
optimization, will thus no longer have to use heuristics as black box 
techniques (as it is frequently done nowadays), but can use them as 
algorithms which can be modified and easily tuned to specific problem 
situations. 


One of our current research interests is to combine agent-based soft- 
ware development techniques with heuristic optimization methods. Here 
again Genetic Programming is one of the fields that would on the one 
hand profit from the intrinsic parallelization of software agents as well as 
improve the quality and expressiveness of found models. Agents could 
be programmed to identify different variables in the given data sets and 
examine a broader range of correlations. Each of these agents repre- 
sents a GP process evolving a population of formulas (models); at given 
synchronization points, these agents exchange information among each 
other. Unlike other parallel GP approaches, in which parts of popula- 
tions are exchanged which in principle all have the same goal (namely 
to solve a given identification task), we here want to establish an in- 
formation exchange mechanism by which partial information about re- 
lationships in the data is passed and shared among the identification 
agents. 

The probably most important goal of such a parallel GP approach is 
to develop an automated mechanism that can identify not only singular 
relationships in data, but rather whole information networks that de- 
scribe lots of relationships that can be found. This incorporates the use 
of GP agents that aim to identify models for different target variables. 
So it should become possible to identify classes of equivalent models 
that differ only in the way certain input variables are described; these 
results will hopefully help to find answers to one of the most important 
questions in system identification, namely which of the potential models 
are best suited for further theoretical analyses. 


We hope that one of the results of this book will be an increased interest 
in population dynamics analysis as well as generic algorithmic developments 
as for example enhanced selection methods for evolutionary algorithms. By 
showing the general applicability of enhanced selection concepts in GAs ap- 
plied to combinatorial problems as well as in GP, we hope that we have been 
able to inspire readers to apply these concepts to other problems as well as to 
include them in other variants of evolutionary algorithms. 


Symbols and Abbreviations 


Symbol Description 


ANN 
AUC 
CGA 
(C)VRP(TW) 


ROC 


Artificial neural network 

Area under a ROC curve 

Canonical genetic algorithm 

(Capacitated) vehicle routing problem (with time win- 
dows) 

Cyclic crossover for the TSP 

Evolutionary algorithm 

Edge recombination crossover for the TSP 
Evolution strategy 

Genetic algorithm 

Genetic programming 

HeuristicLab 

k-nearest neighbor algorithm 

Nitric oxides 

Offspring selection 

Order crossover for the TSP 

Partially matched crossover for the TSP 
Relevant alleles preserving genetic algorithm 
Route-based crossover for the CVRP 
Receiver operating characteristic 
Self-adaptive segregative genetic algorithm including 
aspects of simulated annealing 
Sequence-based crossover for the CVRP 
Standard genetic algorithm 

Tabu search 

Traveling salesman problem 
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symbolic expression, 29 
symbolic regression, 46, 157 
synchronous migration, 21 
system identification, 157, 166, 
166, 167 
basic steps, 166 
parameter identification, 167 
structural identification, 166 


tabu search, 11 
terminal, 28, 173 
constant, 173 
differential, 173 
evaluation, 173 
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